BP-203 Foundations for Mathematical Biology Statistics Lecture III By Hao Li Nov 8, 2001
Statistical Modeling and Inference data collection constructing probabilistic model inference of model parameters interpreting results making new predictions
Maximum likelihood Approach Example A:
Toss a coin N times, observe m heads in a specific sequence Model: binomial distribution Inference: the parameter p Prediction: e.g., how many heads will be observed for another L trials
Prob. of observing a specific sequence of m heads
P (m | p ) = p m (1 − p ) N −m
Find a p such that the above prob. is maximized
∂ log P(m | p ) pˆ = 0 ∂p
pˆ = m / N
log P (m | pˆ ) = N [ pˆ log pˆ + (1 − pˆ ) log(1 − pˆ )]
-entropy
How good is the estimate? Distribution of pˆ under repeated sampling Central limit theorem Æ distribution of m approaches normal for large N
m ~ Np ± Np (1 − p) pˆ ~ p ±
p (1 − p) / N
Thus the estimate converges to the real p with a square-root convergence
Maximum likelihood Approach Example B:
x1 , x2 ,..., x N independent and identically distributed (i.i.d) sample drawn from a normal distribution N ( µ , σ 2 ) Estimate the mean and the variance Maximizing the likelihood function (show this is true in the homework )
N
µˆ = x = ∑ xi / N i =1
N
σˆ 2 = ∑ ( xi − x ) 2 / N i =1
General formulation of the maximum likelihood approach D: observed data M: the statistical model θ parameters of the model probability of observing the data P( D | M , θ ) given the model and parameters
L(θ ; D) ≡ P( D | M , θ )
the likelihood of
θ as a function of data
Maximum likelihood estimate of the parameters
θˆ = arg max L(θ ; D) Theorem: θˆ converges to the true with error ~ 1 / N
θ 0 in the large sample limit
Example C: Segmentation a sequence of head (1) and tail (0) is generated by first using a coin with p1 and then change to a coin with p2 the change point unknown Data = (001010000000010111101111100010)
P( seq, x | p1 , p2 ) = p1
x
m1 ( x )
(1 − p1 ) x − m1 ( x ) p2
position right before the change
m1 ( x) number of 1’s up to x m2 ( x) number of 1’s after x
N
total number of tosses
m2 ( x )
(1 − p2 ) N − x − m2 ( x )
Example C continued For fixed x maximize
P ( seq, x | p1 , p2 )
with respect to
p1 and p2
log P ( seq, x | pˆ 1 , pˆ 2 ) = x[ pˆ 1 log pˆ 1 + (1 − pˆ 1 ) log(1 − pˆ 1 )] + ( N − x)[ pˆ 2 log pˆ 2 + (1 − pˆ 2 ) log(1 − pˆ 2 )] pˆ 1 = m1 ( x) / x pˆ 2 = m2 ( x) /( N − x)
Then maximize
P ( seq, x | pˆ 1 , pˆ 2 )
with respect to x
The above approach is sometime referred as “entropic segmentation”, as it tries to minimize the total entropy A generalization of the above model to 4 alphabet and unknown number of breaking points can be used to segment DNA sequences into regions of different composition. more naturally described by a hidden Markov model.
Example D: detecting weak common sequence patterns in a set of related sequences e.g., local sequence motifs for functionally or structurally related proteins (no overall sequence similarity) regulatory elements in the upstream regions of co-regulated genes, could be genes clustered together by microarray data the simplest situation: each sequence contain one realization of the motif with given length, but the starting positions are unknown
Example: 22 genes identified as pho4 target by microarray, O’shea lab YAR071W:600:-600 \catcaagatgagaaaataaagggattttttcgttcttttatcattttctctttctcacttccgactacttcttatatctactttcatcgtttcattcatcgtgggtgtctaataaagtttta atgacagagataaccttgataagctttttcttatacgctgtgtcacgtatttattaaattaccacgttttcgcataacattctgtagttcatgtgtactaaaaaaaaaaaaaaaaaa gaaataggaaggaaagagtaaaaagttaatagaaaacagaacacatccctaaacgaagccgcacaatcttggcgttcacacgtgggtttaaaaaggcaaattacacag aatttcagaccctgtttaccggagagattccatattccgcacgtcacattgccaaattggtcatctcaccagatatgttatacccgttttggaatgagcataaacagcgtcgaa ttgccaagtaaaacgtatataagctcttacatttcgatagattcaagctcagtttcgccttggttgtaaagtaggaagaagaagaagaagaagaggaacaacaacagcaaa gagagcaagaacatcatcagaaatacca\ YBR092C:600:-600 \aatcaatgacttctacgactatgctgaaaagagagtagccggtactgacttcctaaaggtctgtaacgtcagcagcgtcagtaactctactgaattgaccttctactgggac tggaacactactcattacaacgccagtctattgagacaatagttttgtataactaaataatattggaaactaaatacgaatacccaaattttttatctaaattttgccgaaagatta aaatctgcagagatatccgaaacaggtaaatggatgtttcaatccctgtagtcagtcaggaacccatattatattacagtattagtcgccgcttaggcacgcctttaattagca aaatcaaaccttaagtgcatatgccgtataagggaaactcaaagaactggcatcgcaaaaatgaaaaaaaggaagagtgaaaaaaaaaaaattcaaaagaaatttacta aataataccagtttgggaaatagtaaacagctttgagtagtcctatgcaacatatataagtgcttaaatttgctggatggaagtcaattatgccttgattatcataaaaaaaata ctacagtaaagaaagggccattccaaattacct\ YBR093C:600:-600 \cgctaatagcggcgtgtcgcacgctctctttacaggacgccggagaccggcattacaaggatccgaaagttgtattcaacaagaatgcgcaaatatgtcaacgtatttgg aagtcatcttatgtgcgctgctttaatgttttctcatgtaagcggacgtcgtctataaacttcaaacgaaggtaaaaggttcatagcgctttttctttgtctgcacaaagaaatata tattaaattagcacgttttcgcatagaacgcaactgcacaatgccaaaaaaagtaaaagtgattaaaagagttaattgaataggcaatctctaaatgaatcgatacaaccttg gcactcacacgtgggactagcacagactaaatttatgattctggtccctgttttcgaagagatcgcacatgccaaattatcaaattggtcaccttacttggcaaggcatatac ccatttgggataagggtaaacatctttgaattgtcgaaatgaaacgtatataagcgctgatgttttgctaagtcgaggttagtatggcttcatctctcatgagaataagaacaa caacaaatagagcaagcaaattcgagattacca\ YBR296C:600:-600 \gaaatctcggtttcacccgcaaaaaagtttaaatttcacagatcgcgccacaccgatcacaaaacggcttcaccacaagggtgtgtggctgtgcgatagaccttttttttctt tttctgctttttcgtcatccccacgttgtgccattaatttgttagtgggcccttaaatgtcgaaatattgctaaaaattggcccgagtcattgaaaggctttaagaatataccgtac aaaggagtttatgtaatcttaataaattgcatatgacaatgcagcacgtgggagacaaatagtaataatactaatctatcaatactagatgtcacagccactttggatccttcta ttatgtaaatcattagattaactcagtcaatagcagattttttttacaatgtctactgggtggacatctccaaacaattcatgtcactaagcccggttttcgatatgaagaaaattat atataaacctgctgaagatgatctttacattgaggttattttacatgaattgtcatagaatgagtgacatagatcaaaggtgagaatactggagcgtatctaatcgaatcaatat aaacaaagattaagcaaaaatg\
A model for the motif AAATGA AGGTCC AGGATG AGACGT alignment matrix A C G T
1 4 0 0 0
2 1 0 3 0
3 2 0 2 0
4 1 1 0 2
5 0 1 2 1
6 1 1 1 1
position specific probability matrix A C G T Model:
1 1.00 0.00 0.00 0.00
2 0.25 0.00 0.75 0.00
3 0.50 0.00 0.50 0.00
4 0.25 0.25 0.00 0.50
5 0.00 0.25 0.50 0.25
6 0.25 0.25 0.25 0.25
f i ,σ
probability of observing certain base inside the motif is given by the above matrix probability of observing certain base outside the motif is given by the background frequency
fσ0
Starting positions of the motif unknown
v x = ( x1 , x2 ,..., x N )
f i ,σ
Position specific probability matrix unknown
need to be inferred from the observed sequence data xi + w −1 N xi −1 v 0 P ( seq, x | f i ,σ ) = ∏ ∏ fσ ij ∏ f j − xi +1,σ ij i =1 j =1 j = xi
N L w
σ ij
0
Number of sequences Length of the sequence Width of the motif Base of sequence i at position j w v P ( seq, x | f i ,σ ) = const ∏∏ j =1
v n j ,σ ( x )
fσ ij ∏ j = xi + w L
σ
v n j ,σ ( x )
( ) f j ,σ
likelihood ratio
0
fσ
Total number of count for base σ at position j in the alignment
Maximizing P( seq, xv | f i ,σ ) w.r.t.
f i ,σ
With
w v ˆ log P ( seq, x | f i ,σ ) = N ∑ fˆ j ,σ log j =1
fˆ j ,σ =
v n j ,σ ( x ) v n j ,σ ( x )
∑ σ
v x
fixed
( ) fˆ j ,σ fσ0
log likelihood ratio relative entropy
in reality, this formula is modified by adding pseudo counts due to Baysian estimate
Then maximize the above relative entropy w.r.t Æ Alignment path.
v x
Stormo-Hartzell Algorithm: Consensus each of the length w substrings of the first sequence are aligned against all the substrings of the same length in the second sequence, matrices derived, N top matrices with highest information contents are saved the next sequence on the list is added to the analysis, all the matrices saved previously are paired with the substrings of the added sequence and top N matrices saved repeat the previous step until all the sequences have been processed
Consensus output for Pho4 regulated genes MATRIX 1 number of sequences = 22 information = 8.80903 ln(p-value) = -153.757 p-value = 1.67566E-67 ln(expected frequency) = -13.357 expected frequency = 1.58165E-06 A| G| C| T|
6 5 11 0 4 17 1 0 G C
20 3 0 0 5 22 0 14 0 2 0 0 A C G
3 0 0 19 T
0 21 1 0 G
0 15 2 5 G
0 6 14 2 8 1 0 13 G T
1|1 : 1/317 ACACGTGGGT 2|2 : 2/55 AAAGGTCTGT 3|3 : 3/347 ACACGTGGGA 4|4 : 4/274 GCACGTGGGA 5|5 : 5/392 CAACGTGTCT 6|6 : 6/395 ACAAGTGGGT 7|7 : 7/321 ACACGTGGGA 8|8 : 8/536 GCAAGTGGCT 9|9 : 9/177 GCTGGTGTGT 10|10 : 10/443 GCACGTGTCT 11|11 : 11/14 CCAGGTGCCT 12|12 : 12/502 GAAAGAGGCA 13|13 : 13/354 GCACGAGGGA 14|14 : 14/257 GCACGTGCGA 15|15 : 15/358 TCACGTGTGT 16|16 : 16/316 ACACGTGGGT 17|17 : 17/479 GCACGTGGCT 18|18 : 18/227 GATGGTGGCT 19|19 : 19/186 GCACGTGGGG 20|20 : 20/326 GAAGGAGGGG 21|21 : 21/307 CCACGTGGGC 22|22 : 22/255 CCACGTGGCT
Maximum likelihood estimate with missing data General formulation Expectation and Maximization (EM) algorithm Missing data: in example C, the point where the coin is changed in example D, the starting positions of the motif in the maximum likelihood approach, there is a crucial distinction between parameters (population) such as the position specific probability matrix and the missing data, since missing data grow with the sample size and in general can not be recovered precisely even if the sample size goes to infinity For many problems, it is necessary to sum over all missing data
L( x; θ ) = ∑ P( x, y | θ ) y
Where
x
is the observed data and
y
is the missing data
To estimate the parameters, one maximize the likelihood function L( x; θ ) however, it is often difficult to perform the summation over missing data explicitly Expectation Maximization (EM) algorithm Improve the estimate of the parameters iteratively Given an estimate θ t find θ t +1 that increases the likelihood function E step: calculate the Q function, the expectation of log P( x, y | θ ) over missing data with prob. given by the current parameter M step: maximize the Q function to get an new estimate
Q(θ | θ t ) ≡ ∑ P( y | x, θ t ) log P ( x, y | θ ) y
θ t +1 = arg max Q(θ | θ t )
That the EM algorithm always increase the likelihood function Can be proved by the following equation and inequality
log P ( x | θ ) − log P ( x | θ t ) = Q(θ | θ t ) − Q(θ t | θ t ) +
∑ P ( y | x, θ y
t
) log
P ( y| x ,θ t ) P ( y| x ,θ )
log P ( x | θ ) − log P( x | θ t ) ≥ Q(θ | θ t ) − Q(θ t | θ t )
Example E: identifying combinatorial motifs by word segmentation motif1
motif2
A set of Regulatory Sequences
How do we find these motifs?
motif3
chapterptgpbqdrftezptqtasctmvivwpecjsnisrmbtqlmlfvetl loomingsfkicallxjgkmekysjerishmaeljplfsomeylqyearstvh njbagoaxhjtjcokhvneverpmqpmindhowzrbdlzjllonggbhqi preciselysunpvskepfdjktcgarwtnxybgcvdjfbnohavinglittl ezorunozsoyapmoneyyvugsgtsqintmyteixpurseiwfmjwgj nyyveqxwftlamnbxkrsbkyandrnothingcgparticularwtzao qsjtnmtoqsnwvxfiupinterestztimebymonlnshoreggditho ughtyxfxmhqixceojjzdhwouldsailpcaboutudxsbsnewtpg gvjaasxmsvlittleplvcydaowgwlbzizjlnzyxandzolwcudthjd osbopxkkfdosxardgcseebbthefzrsskdhmawateryjikzicim ypartmofprtheluworldvtoamfutitazpisagwewayrqbkiosh avebojwphiixofprmalungipjdrivingpkuyoikrwxoffodhicb nimtheixyucpdzacemspleenqbpcrmhwvddyaiwnandada bkpgzmptoregulatingeetheslcirculationvsuctzwvfyxstuzr dfwvgygzoejdfmbqescwheneverpitfindfmyselfcgrowingne ostumrydrrthmjsmgrimcczhjmgbkwczoaboutjbwanbwzq thehrjvdrcjjgmouthuutwheneveritddfouishlawwphxnae
Bussemaker/Li/Siggia Model:
Probabilistic Segmentation/Maximum likelihood A probabilistic dictionary Words probabilities
maximizing the likelihood function
A Æ PA C Æ PC G Æ PG T Æ PT GC Æ PGC TATAA Æ PTATAA
A|G|T|A|T|A|A|G |C A|G|T A T A A|G C A|G|T A T A A|G |C
Z = ∑ Pw1 Pw2 Pw3 ...Pwn Seg
word boundary is missing
Dictionary Construction Parameter inference: given the entries in the dictionary, find PW by maximizing the likelihood function. Starting with a simple dictionary with all possible words Model improvement: do statistical test on longer words based on the current dictionary, add the ones that are over-represented re-assign PW by maximizing the likelihood function Iterate the above
EM algorithm for the word segmentation
N w (Seg )
Number of word w in a given segmentation
L({ pw }; seq) = ∑∏ ( pw ) N w ( Seg ) Seg
E step
Q({ pw } | { pw (t )}) = ∑ nw t log pw w
M step
pw (t + 1) =
Nw
∑ w
t
Nw
t
Ditionary1
Ditionary2
Dictionary3
-----------------------------------------------------------------------------
e t a o n i s h r l d u m g w c f y p b v k j q z x
0.065239 0.055658 0.052555 0.050341 0.049266 0.048101 0.047616 0.047166 0.043287 0.041274 0.039461 0.034742 0.034349 0.034001 0.033967 0.032934 0.032597 0.031776 0.031711 0.031409 0.028268 0.028113 0.026712 0.026561 0.026542 0.026357
e s a t i d o l g r c m n y p f b w h v k u j z x q th in er an ou on he at ed or en to of st nd
0.048730 0.042589 0.040539 0.040442 0.038550 0.038547 0.036486 0.036300 0.034509 0.034496 0.033916 0.033724 0.033321 0.033227 0.033156 0.032863 0.032780 0.032009 0.031494 0.030727 0.030445 0.030379 0.029268 0.028905 0.028404 0.028123 0.009954 0.006408 0.004755 0.004352 0.003225 0.003180 0.003108 0.002851 0.002804 0.002786 0.002538 0.002511 0.002475 0.002415 0.002297
e s a i t d l c m y b r p w n g f o h k v j u z x q the ing and in ed to of en an th er es at it that
0.042774 0.040843 0.038595 0.036897 0.036871 0.036323 0.035336 0.034818 0.034650 0.034482 0.034396 0.034105 0.034044 0.033819 0.033817 0.033676 0.033534 0.033206 0.033200 0.032103 0.031498 0.031209 0.031186 0.031003 0.030544 0.030244 0.005715 0.003237 0.003128 0.002968 0.002547 0.002496 0.002486 0.001331 0.001313 0.001270 0.001250 0.001209 0.001181 0.001171 0.001165
Words
quality factor
--------------------------------------------------------------------------------
abominate achieved aemploy affrighted afternoon afterwards ahollow american anxious apartment appeared astonishment attention avenues bashful battery beefsteaks believe beloved beneath between boisterous botherwise bountiful bowsprit breakfast breeding bulkington bulwarksb bumpkin business carpenters
2.0000 2.0000 2.0000 2.0000 2.0000 5.0000 2.0000 3.0000 2.0000 2.0000 4.0000 4.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 6.0000 12.0000 3.0000 2.0000 2.0000 2.0000 5.0000 2.0000 3.0000 2.0000 2.0000 6.0000 2.0000
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Table 1. Known cell cycle sites and some metabolic sites that match words from our genomewide dictionary MCB
ACGCGT
AAACGCGT ACGCGTCGCGT CGCGACGCGT TGACGCGT
SCB
CRCGAAA
ACGCGAAA
SCB'
ACRMSAAA
ACGCGAAA ACGCCAAA AACGCCAA
Swi5
RRCCAGCR
GCCAGCG GCAGCCAG
SIC1
GCSCRGC
GCCCAGCC CCGCGCGG
MCM1
TTWCCYAAWNNGGWAA
TTTCCNNNNNNGGAAA
NIT MET
GATAAT TCACGTG
TGATAATG RTCACGTG TCACGTGM CACGTGAC CACGTGCT
PDR
TCCGCGGA
TCCGCGG
HAP
CCAAY
AACCCAAC
MIG1
KANWWWWATSYGGGGW
TATATGTG CATATATG GTGGGGAG
GAL4
CGGN11CCG
CGGN11CCG
our dictionary vs. known TF binding sites Yeast promoter database 443 non-redundant sites (Zhu and Zhang, cold spring harbor)
Our dictionary
# of matches
Expected (standard deviation)
114
25 (4.8)
Scrambled dictionary 33
14 (3.3)
Brazma et al.
9 (2.9)
30
BP-203 Foundations for Mathematical Biology Statistics Lecture IV By Hao Li Nov 13, 2001
Bayesian Inference Example A revisited: Toss a coin N times, observe m heads N is small Model: binomial distribution e.g., N=2, m=0, would you infer p = m / N = 0 ? There is a wide range of p that can produce the observed result Our prior knowledge Æ p = 0 unlikely
Bayesian Inference: infer a probability distribution of the parameters conditioned on the observed data input our prior knowledge of the distribution of the parameters
Bayes’ theorem Let A and B1 , B2 ,..., Bn be events where Bi are disjoint and exhaustive (cover all sample space), and P( Bi ) > 0 for all i, then prior knowledge of B j
P( B j | A) = distribution of B j conditioned on A
P ( A| B j ) P ( B j ) n
∑ P ( A|B j ) P ( B j ) j =1
Example F: polygraph tests/lie detector tests (discrete sample space) Events: L: subject is lying T: subject is telling truth +: polygraph reading is positive (indicating that the subject is lying) -: polygraph reading negative (indicating that the subject is telling truth) Polygraph reliability Æ conditional probability (conditioned on L and T) L + -
T
0.88
0.14
0.12
0.86
One specific screen, prior P(T)= 0.99 P(L)= 0.01 What is P(T|+), the prob. that the reading is positive but the subj. is telling truth?
P (T | +) =
P ( +|T ) P (T ) P ( +|T ) P (T ) + P ( +| L ) P ( L )
=
0.14×0.99 0.14×0.99 + 0.88×0.01
= 0.94
Bayesian inference for continuous parameters prior distribution
P(θ | Data) =
P ( Data|θ )π (θ )
∫ P ( Data|θ )π (θ ) dθ
posterior distribution
Example A revisited P(m | p) = p m (1 − p) N − m
P ( p | m) =
posterior distribution conjugate prior: same functional form as the likelihood function
π ( p) =
P ( m| p )π ( p )
∫ P ( m| p )π ( p ) dp
Γ (α + β ) Γ (α ) Γ ( β )
pα −1 (1 − p ) β −1
P( p | m) = const × p m +α −1 (1 − p) N − m + β −1
Beta distribution
1
Β(a, b) ≡ ∫ x a −1 (1 − x)b −1 dx = 0
average over posterior dis.
p=
Γ( x) = ( x − 1)Γ( x − 1)
Γ ( a ) Γ (b ) Γ ( a +b )
∫ pP( p | m)dp =
dp ∫ p (1− p ) = m +α −1 N − m + β −1 dp ∫ p (1− p ) m +α
N − m + β −1
m +α N +α + β
pseudo counts Generalize to T alphabet Multinomial Dirichlet distr
T v v m P(m | p ) = ∏ pi i i =1
v
T
π ( p) = ∏ piα −1 i
i =1
pi =
mi +α i T
∑ ( mi +α i ) i =1
T
∑ mi = N i =1
T
∑p
i
i =1
=1
Example C: Segmentation Revisited a sequence of head (1) and tail (0) is generated by first using a coin with p1 and then change to a coin with p2 the change point unknown Data = (001010000000010111101111100010) Infer the distribution of parameters as well as missing data h ( x)
P( seq | p1 , p2 , x) = p1 1 (1 − p1 ) t1 ( x ) p2
x
h2 ( x )
(1 − p2 ) t2 ( x )
position right before the change
h1 ( x)
number of heads up to x. t1 ( x)
h2 ( x)
number of heads after x.
N
total number of tosses
number of tails up to x
t 2 ( x) number of tails after x
In the Bayesian approach, we treat the parameters and the missing data on the same footing
posterior dis.
P( p1 , p2 , x | seq) =
P ( seq| p1 , p2 , x )π ( p1 , p2 , x )
∑ ∫∫ dp1dp2 P ( seq| p1 , p2 , x )π ( p1 , p2 , x ) x
prior
uniform on x
π ( p1 , p2 , x) = p1α −1 (1 − p1 ) β −1 p2α −1 (1 − p2 ) β 1
1
2
2 −1
/( N + 1)
posterior dis. P( x | seq) = ∫∫ dp1dp2 P( p1 , p2 , x | seq) of x P ( x | seq) =
Β(h1 ( x) + α1 , t1 ( x) + β1 )Β(h2 ( x) + α 2 , t 2 ( x) + β 2 ) ∑ Β(h1 ( x) + α1 , t1 ( x) + β1 )Β(h2 ( x) + α 2 , t2 ( x) + β 2 ) x
Problem sets
1. Suppose that
x1 , x2 ,..., x N
are independent and identically distributed (i.i.d) sample drawn from a normal distribution N ( µ , σ 2 ) a) show that the maximum likelihood estimate (MLE) for the mean and variance is given by the following formula N
µˆ = x = ∑ xi / N i =1
N
σˆ 2 = ∑ ( xi − x ) 2 / N i =1
b) calculate the mean and variance of µˆ and σˆ under repeated sampling. Show that the MLE converge to the true values with 1 / N error 2
2 2 hint: recall from my second lecture that Nσˆ / σ has a chi-square distribution with (N-1) degrees of freedom
Problem sets 2.
Maximum likelihood estimate for the parameters of multinomial distribution. Consider N independent trials, each can result in one of T possible outcomes, with probabilities p1 , p2 ,..., pT the observed numbers for the T possible outcomes are m1 , m2 ,..., mT calculate the MLEs for the probabilities. T v v m P(m | p ) = const ∏ pi i
hint: write down the likelihood function i =1 and use Lagrangian multiplier to implement the constraint
T
∑p
i
=1
i =1
3.
Entropic segmentation. Consider the example C in my previous lecture. use the some observed data. Suppose you know that the change point x is between 12 and 16. Find x that minimize the entropy (maximizing the likelihood)
4.
Bayesian inference for a Poisson process. Suppose that four samples (1,1,0,3) are drawn from a Poisson distribution specified by a mean λ . Assuming a uniform prior distribution for λ . calculate the posterior distribution of λ , λˆ that maximize the posterior distribution, and λ which is the average over posterior distribution.
UCSF cancer center
BP-203: Foundations of Mathematical Biology Statistics Lecture I: October 23, 2001, 2pm
Instructor: Ajay N. Jain, PhD Email: [email protected] Copyright © 2001, Ajay N. Jain All Rights Reserved
UCSF cancer center
Introduction
Probability ♦ Probability distributions underlie everything we measure and predict ♦ Hao Li covered many aspects of probability theory: random variables, probability distributions (normal, Poisson, binomial…)
Statistics: ♦ Statistics can be used to quantify the importance of measured effects ♦ I will cover basic statistical methods ♦ Good reference: Statistical Methods, Snedecor and Cochran (Eighth Edition)
UCSF cancer center
Lecture I
What is a statistic? How is it related to a probability distribution? Frequency distributions Mean and standard deviation: population vs. sample Example: Uniform distribution Central Limit Theorem: The distribution of sample means and sample standard deviations Confidence intervals Hypothesis testing Common parametric statistics
UCSF cancer center
What is a statistic?
Statistics: techniques for collecting, analyzing, and drawing conclusions from data. Probability theory is about populations We only know about samples A statistic is a computable quantity based on some sample of a population from which we can make inferences or conclusions.
UCSF cancer center
Frequency distributions, histograms, and cumulative histograms
A frequency distribution is a compact method to capture the characteristics of variation for a collection of samples.
Histogram from uniform distribution
Graphically, it can be represented ♦ Histogram with fixed bin sizes ♦ Cumulative histogram
It is different from a probability distribution, which is generally not known
Cumulative histogram
Mean and standard deviation for a discrete probability distribution versus a sample
UCSF cancer center
Discrete probability distribution: mean and SD k
µ = ∑ Pj X j j =1
σ=
2
k
∑ P (X j
j =1
j
− µ)
Sample of size n: sample mean and sample SD n
n
( X1 + X 2 + + X n ) X= = n
∑X i =1
n
i
s=
∑ (X i =1
−X)
2
i
n −1
UCSF cancer center
From Hao Li’s lectures: The normal distribution
The normal distribution is the most important in statistics
1 f ( x) = e σ 2π
−( x−µ )2 ( 2σ 2 )
General normal density
Φ( z ) = ∫
z
−∞
1 e 2π
−z2 2
Standard normal cumulative density function
UCSF cancer center
Why?
Many distributions naturally occuring are approximately normal. Any variable whose expression results from the additive effects of many small effects will tend to be normally distributed. Often, for non-normal cases, simple transformation yields a normal distribution (e.g. square root, log) The normal distribution has many convenient mathematical properties. Even if the distribution in the original population is far from normal, the distribution of sample means tends to become normal under random sampling.
UCSF cancer center
Mean and standard deviation of sample mean
If we take repeated random samples of size n from any population (normal or not) with mean µ and standard deviation σ, the frequency distribution of the sample means has mean µ and standard deviation µ/sqrt(n) Restated: the sample mean is an unbiased estimator of the population mean. Further, as n increases, the sample mean becomes a better estimator of the mean.
UCSF cancer center
The Central Limit Theorem
Previous slide was about the mean and SD of sample means. The CLT is about the distribution of the sample means. If X is the sample mean of a population with mean µ and standard deviation σ, as n approaches infinity: −( x−µ )2
P ( L1 < X < L2 ) = ∫
L2
L1
1
σ 2π n
e
2σ 2 n
dx
Example: Uniform discrete distribution (0…100)
UCSF cancer center
1 1 100(100 + 1) µ = ∑ Pj X j = ∑ X= = 50 101 2 j =1 X = 0 101 100
k
σ=
2
k
∑ P (X j
j =1
j
− µ)
2
101
1 ( X − 50) = 29.155 = ∑ X = 0 101 n
n
( X1 + X 2 + m + X n ) X= = n
∑X i =1
n
i
s=
∑ (X i =1
−X)
2
i
n −1
UCSF cancer center
Consider the sample mean of this uniform distribution
The parent distribution is uniform, with mean 50 and standard deviation 29.155 What is the distribution of sample means from this parent distribution? Let’s pick n (= 1, 3, 100) observations, with replacement, from this distribution, and compute the sample mean many times (100,000) ♦ What will the mean of the sample means be? ♦ What will the standard deviation of the sample means be? ♦ What will the distribution of the sample means look like?
UCSF cancer center
N = 1: We see a uniform distribution
0.014
0.012
0.01
0.008
0.006
0.004
0.002 0
10
20
30
40
50
60
70
80
90
100
Mean 50.148270 (pop mean: 50) SD 29.138903 (pop sd: 29.155)
UCSF cancer center
N = 3: Pretty close to normal
0.025
0.02
0.015
0.01
0.005
0 0
10
20
30
40
50
60
70
80
90
100
Mean 50.089103 (CLT 50) SD 16.785106 (16.8326)
UCSF cancer center
N = 100: Essentially normal
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0 30
35
40
45
50
55
60
65
Mean 49.996305 (CLT 50) SD 2.916602 (2.916)
70
UCSF cancer center
So now what?
We have a theorem that tells us something about the relationship between the sample mean and the population mean. We can begin to make statements about what a population mean is likely to be given that we have computed a sample mean.
Suppose we know the population standard deviation (this never happens)
UCSF cancer center
We assume that the sample mean computed from n observations comes from a distribution with mean µ and standard deviation sigmapop/sqrt(n) So, 95% of the time:
µ − 1.96
σ pop n
< X < µ + 1.96
σ pop n
or X − 1.96
σ pop n
< µ < X + 1.96
This is the 95% confidence interval for µ
σ pop n
UCSF cancer center
Suppose we don’t know the population standard deviation (this always happens)
We will use the sample standard deviation s as an estimate for the population standard deviation. The procedure is very similar to the previous confidence interval, but the distribution follows Student’s t distribution instead of the normal distirbution.
X − 1.96 X − t0.05
σ pop n
< µ < X + 1.96
σ pop n
s s < µ < X + t0.05 n n
This is the 95% confidence interval for µ. We look up t in a table that depends on n-1 and 0.05.
UCSF cancer center
Confidence Interval Example
UCSF cancer center We wish to test whether we’ve seen a real effect ♦ H0 denotes the null hypothesis: no real effect ♦ H1 denotes the alternative: real effect
Statistical jargon ♦ Rejecting H0 when it is true is defined as a type I error • Informally: false positive • Significance level: probability of rejecting H0 when it is true
♦ Rejecting H1 when it is true is a type II error • Informally: false negative • Power: probability of rejecting H0 when it is false
Statistical Hypothesis Testing When we know the distributions (or can safely make assumptions) we can use tests like the t-test When we cannot, we must use non-parametric tests (next lectures)
UCSF cancer center
Testing a mean when SD not known
Process is very similar to confidence intervals. We want to test whether our mean is different from a particular value. Compute t as follows:
( X −µ ) t= 0
s n
For a particular level α and n-1 degrees of freedom, we look up t for 2 α
UCSF cancer center
Test of the difference of two sample means: T-test with equal variances
Two samples of size n and m, with sample SD U and V and sample means X and Y:
t=
X −Y 1 1 S + n m
(n − 1)U + (m − 1)V S= n+m−2
We use n+m-2 as the number of degrees of freedom in finding our critical value.
UCSF cancer center
Expression array example: Lymphoblastic versus myeloid leukemia
Lander data ♦ 6817 unique genes
REPORTS
♦ Acute Lymphoblastic Leukemia and Acute Myeloid Leukemia (ALL and AML) samples
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring
♦ RNA quantified by Affymax oligo-technology
T. R. Golub,1,2*† D. K. Slonim,1† P. Tamayo,1 C. Huard,1 M. Gaasenbeek,1 J. P. Mesirov,1 H. Coller,1 M. L. Loh,2 J. R. Downing,3 M. A. Caligiuri,4 C. D. Bloomfield,4 E. S. Lander1,5*
♦ 38 training cases (27 ALL, 11 AML) ♦ 34 testing cases (20/14)
We will consider whether any of the genes are differently expressed between the ALL and AML classes
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge. SCIENCE VOL 286 15 OCTOBER 1999
UCSF cancer center We compute 6817 t-statistics (one for each gene) What is the critical value? ♦ P = 0.05 ♦ N = 27
We have two classes: Use the T-statistic What happened? We made 6817 independent tests of a statistic at a significance level of 0.05 We should expect about 341 genes to show up even if we have no real effect
♦ M = 11 ♦ Degrees of freedom = 27+11-2 = 36 ♦ Critical value (two-tailed test): 2.03
Of the 6817 genes, 1636 are “significant” Less than 40% of these are significant on the test set!
We can correct for this in many ways. One is to use a critical value for 0.05/6817 (due to Bonferonni). We will talk about other methods to avoid these problems in the next lecture
UCSF cancer center
Frequency distribution of sample variance
We discussed the frequency distribution of sample means The Chi-square distribution is also important If Xi are drawn from a normal distribution with variance σ2, the following distribution will follow Chi-square
(n − 1)s σ
2
2
We can derive confidence intervals on sample variances as we did with sample means. More important, however, are the Chi-squared tests for goodness of fit and Chi-squared tests in contingency tables.
UCSF cancer center
Chi-squared test of goodness of fit
We have some hypothesis about the true distribution from which a set of observations were drawn We compute the following value:
( f i − Fi ) χ =∑ Fi i =1 2
k
2
We use (k-1) for the number of degrees of freedom. If we had to estimate values for the parent distribution, we reduce the number of degrees of freedom (e.g. (k-3) if we estimated the mean and SD from the data)
UCSF cancer center
Example: Chi-squared test of goodness of fit
UCSF cancer center
Contingency tables
Very often, we have data where each sample is classified by two different characteristics into disjoint subsets Example: Set of patients in a study ♦ Treatment group versus control group ♦ Responders versus non-responders
We can use RxC contingency tables to decide whether there is any significance difference among the groups in terms of deviations from expected frequencies.
UCSF cancer center
( f − F )2 ∑ F
Chi-square example: RxC table
F = (1796/8766)*(4258/8766)*8766 = 872.4 DF = (R-1)(C-1) = 4
UCSF cancer center
What about paired data?
Thus far, we have considered the comparison of unpaired data. The most common parametric method for considering paired data is Pearson’s correlation, r n
r=∑ i =1
( X i − X i )(Yi − Yi )
(∑ ( X
2 − X ) i i
)(∑ (Y − Y ) ) 2
i
i
R ranges from -1 to 1. It is exactly 1 if X and Y are linearly related with positive slope. It is exactly -1 is X and Y are linearly related with negative slope. It is extremely sensitive to outliers. We will discuss non-parametric methods to deal with paired data in the next lecture. Mark Segal will talk about regression.
UCSF cancer center
Conclusions
The most important thing to understand is the relationship between statistics and probability distributions Many statistics have been developed that have known distributions, and of which we can make use for tests of hypotheses If you need to make use of serious parametric statistics, you should learn a package like R or S-plus Statistics books are notoriously opaque in terms of just saying what the formula is! If you are actually concerned with knowing whether you have a real result, use non-parametric tests with resampling methods to decide significance (next lectures).
UCSF cancer center
BP-203: Foundations of Mathematical Biology Statistics Lecture II: October 23, 2001, 2pm Instructors: Ajay N. Jain, PhD Jane Fridlyand, PhD Email: [email protected]
Copyright © 2001 All Rights Reserved
UCSF cancer center
Non-parametric statistics
Parametric statistics ♦ Require an assumption about the underlying distributions of data ♦ With those assumptions, they often provide sensitive tests of significance ♦ However, the assumptions are often not reasonable, and it can be difficult to establish reasonableness
Non-parametric statistics ♦ Make no assumption about distributional characteristics ♦ Lacking those assumptions, they may not be as sensitive as appropriately applied parametric tests ♦ However, one avoids the issue of whether a set of distributional assumptions is correct ♦ Note: if your data are so close to the edge of nominal significance that you need to play games with different statistical tests, you have bigger problems to worry about.
UCSF cancer center
Resampling and permutation-based methods
Non-parametric statistics reduce reliance on distributional assumptions about your data However, the distributions of the statistics themselves are often derived based on approximations that involve other assumptions Resampling and permutation-based methods move toward deriving everything from the data observed
UCSF cancer center
Lecture II
Non-parametric statistical tests ♦ Unpaired data • Rank sum test • Comparisons of distributions · Kolmogorov-Smirnov test · Receiver-operator characteristic curves: measure separation of distributions
♦ Paired data • Signed rank test • Spearman’s rank correlation • Kendall’s Tau
Resampling methods: Jane Fridlyand, PhD ♦ The bootstrap (Efron), Jacknife ♦ Using resampling methods for confidence intervals and hypothesis testing
UCSF cancer center
Lecture III (Tuesday)
General procedure for applying permutation-based methods to derive significance Application of resampling and permutation methods to array-based data General strategy for designing experiments involving large numbers of measurements Homework will be assigned (it will not require programming)
UCSF cancer center
A test of location: The rank-sum test
The rank sum test is used to test the null hypothesis that two population distribution functions corresponding to two random samples are identical against the alternative hypothesis that they differ by location Also called the Wilcoxon or Mann-Whitney rank sum test (Wilcoxon invented it)
UCSF cancer center
A test of location: The rank-sum test
Two samples with n1 and n2 observations ♦ Order all observations from low to high ♦ Assign to each observation its rank ♦ For ties, assign each observation the average rank (e.g. if 3 tied observations occupy ranks 9, 10, 11, we assign each 10) ♦ Sum the ranks for the set 1 ♦ Sum the ranks for the set 2 ♦ If n1 = n2, take the smaller sum: this is T ♦ If not:
T1 = sum _ of _ smaller _ n T2 = n1 (n1 + n2 + 1) − T1 T = min(T1 , T2 )
UCSF cancer center
Example: Rank sum test
So, T = 60 For small n, we can look up the numbers in a table
UCSF cancer center
Rank sum significance For larger n, we must use an approximation based on the normal distribution:
Z = ( µ − T − 0.5) / σ
µ = n1 (n1 + n2 + 1) / 2 σ = n2 µ / 6 If Z > 1.96 we have significance at p = 0.05
UCSF cancer center
How good is the test?
In large normal samples, the t test is slightly better at finding significant differences In small non-normal samples, the rank sum test is rarely much worse than the t test and is often much better
UCSF cancer center
Comparing distributions
Suppose we want to know if there is any difference between the distributions of two sets of observations We don’t care if the difference is location or dispersion The Kolmogorov-Smirnov test ♦ Informally: related to the maximum difference between the cumulative histograms of the two sample sets J=
mn max{chist ( pop1 ) − chist ( pop2 ) } gcd(m, n)
Again, look up whether J is big enough to reject the null hypothesis that the distributions are the same.
UCSF cancer center
Informal example: Relationship of genomic copy number to gene expression
UCSF cancer center
Example: Kolmogorov-Smirnov test
We are looking at the ability of people to generate saliva on demand, plus and minus feedback to tell them if they are successful. Our max chist difference is 6/10. Our multiplier (mn/(gcd(m,n)) is (10*10/10 = 10) So J = 6. From a table, we get p = 0.0524
We sort all of our samples. We compute the cumulative histogram using the values from each set as the thresholds (since these are the only points where a change will happen). We find the max difference.
UCSF cancer center
Molecular similarity: Quantitative comparison of 2D versus 3D
Nicotine example ♦ Nicotine
N
♦ Abbott molecule: competitive agonist
N
♦ Natural ligand (acetylcholine)
Ranked list versus nicotine places competitive ligands last
0.99 N
N HO
0.90
0.89 N
N
O
N
N
O
O
0.82
♦ Very efficient algorithm ♦ Can search 100,000 compounds in seconds
N
N
2D similarity
N
N
N
1.00
♦ Pyridine derivatives ♦ Graph-based approach to comparing organic structures
N
N
N
N
0.57
0.73 N
0.65 N O
N
0.54
N
N
0.58 O O N+
N
0.45
0.13
UCSF cancer center
Molecular similarity: 2D versus 3D
Nicotine example ♦ Nicotine ♦ Abbott molecule: competitive agonist
1.00
♦ Pyridine derivatives
3D similarity
0.93
N
0.91 N
N
N
N
N
N
♦ Requires dealing with molecular flexibility and alignment
O
N
0.90 N
♦ Much slower, but fast enough for practical use
Ranked list places the Abbot ligand near the top, and acetylcholine has a “high” score
N
N
0.97
N
♦ Surface-based comparison approach
N O
N
N
♦ Natural ligand (acetylcholine)
N
N
N
0.89
0.88
N
0.87 N
O
O N HO
0.87
N
0.83
O N+
N
O
0.82
0.63
UCSF
Morphological similarity: Measure the molecules from the outside
cancer center
N
N O
N
N
Similarity between molecules is defined as a function of the differences in surface measurements from observation points.
UCSF cancer center
UCSF cancer center Data from: G. Jones, P. Willett, R. C. Glen, A. R. Leach, & R. Taylor, J. Mol. Biol 267 (1997) 727-748 ♦ 134 protein/ligand complexes (> 20 different proteins with multiple ligands) ♦ 74 related pairs of molecules (small sample from space of all possible related pairs of molecules) ♦ 680 unrelated pairs (randomly selected set above, avoiding pairs known to bind competitively)
See: A. N. Jain. Morphological Similarity... J. Comp.-Aided Mol. Design. 14: 199-213, 2000.
Data For each technique, we compute an estimate of two distributions ♦ Distribution of random variable X (similarity function of ω, the pair of molecules) for ω in the space of related pairs ♦ Distribution of random variable X (similarity function of ω, the pair of molecules) for ω in the space of unrelated pairs ♦ Compare the estimated density functions and the cumulative distribution functions
UCSF cancer center
Molecular similarity: 2D
2D similarity ♦ Graph-based approach to comparing organic structures
N
♦ Very efficient algorithm
1.00
♦ We compute all atomic paths of length K in a molecule of size N atoms ♦ We mark a bit in a long bitstring if the corresponding path exists ♦ We fold the bitstring in half many times, performing an OR, thus yielding a short bitstring ♦ Given bitstrings A and B, we compute the number of bits in common divided by the total number of bits in either
N
0.99
N
What is the algorithm?
N
N
N
N
♦ Can search 100,000 compounds in seconds
N
N
N
N HO
0.90
0.89 N
N
O
N
N
O
O
0.82 N
N
0.57
0.73 N
0.65 N O
N
0.54
N
N
0.58 O O N+
N
0.45
0.13
Complexity: Computing the bitstring is O(N); computing S(A,B) is essentially constant time (small constant!)
UCSF cancer center
Molecular similarity: 3D
3D similarity ♦ Surface-based comparison approach
N
♦ Requires dealing with molecular flexibility and alignment
N
♦ Much slower, but fast enough for practical use
Key issues: not number of atoms. Number of rotatable bonds, alignment
0.93
N
0.91 N
N
N
N
N
N
O
N
♦ For each conformation, optimize the conformation and alignment of the other molecule to maximize S ♦ Report the average S for all optimizations
N
N
0.97
N
♦ Take a sampling of the conformations of molecules A and B
N O
N
1.00
What is the algorithm?
N
N
0.90 N
0.89
0.88
N
0.87 N
O
O N HO
0.87
N
0.83
O N+
N
O
0.82
0.63
Distributions for the two methods are very different: UCSF cancer center What are the quantitative overlaps? Molecule pairs observed crystallographically to bind the same sites Molecule pairs observed crystallographically to bind different sites
Daylight Tanimoto Similarity (Probability distribution and integration)
(Probability distribution and integration)
Morphological Similarity
Morphological similarity
Topological similarity
The unrelated pairs distributions are nearly normal The related pairs distributions are multi-modal, possibly a mixture of normals
UCSF cancer center
Receiver operator characteristic curves (ROC curves) plot the relationship of TP rate and FP rate
To construct a ROC curve: ♦ Vary the similarity threshold over all observed values ♦ At each threshold, compute the proportion of true positives and the proportion of false positives
Morphological Similarity Daylight Tanimoto Similarity
♦ At low thresholds, we should have high FP, but perfect TP
At a false positive rate of 0.05, MS yields a 47% reduction in the number of related pairs that are lost At a true positive rate of 0.70, MS yields a 7-fold better elimination of false positives
True positive rate
♦ At high thresholds, we should have low FP, but poorer TP
False positive rate
UCSF cancer center
Paired data
Spearman’s rank correlation test ♦ We have (Xi,Yi) for n samples ♦ We want to know if there is a relationship between the paired samples, but we don’t know if it should be linear, so we need an alternative to Pearson’s r ♦ Replace the (Xi,Yi) with (Rank(Xi),Rank(Yi)) ♦ Compute Pearson’s r for the new values ♦ Alternative formulation, where d = difference in ranks for each data pair 2
rs = 1 −
∑d 2
n(n − 1)
UCSF cancer center
Example: Spearman’s rank correlation
UCSF cancer center
Paired data
Kendall’s Tau: another rank correlation test ♦ We have (Xi,Yi) for n samples ♦ Definition • Look at all pairs (Xi,Xj) and corresponding (Yi,Yj) • Score a 1 for a concordant event • Score a -1 for a discordant event • Score 0 for ties in values • Normalize result based on the number of comparisons • We get a statistic from -1 to 1
♦ Kendall’s Tau has a slightly nicer frequency distribution ♦ It can be less sensitive to single outliers
UCSF cancer center
Codelet to compute Kendall’s Tau (generalized for real-valued ties)
double k_tau(double *actual, double *predicted, int n, double delta1, double delta2) { long int i,j; double total = 0.0, compare = 0.0; for (i = 0; i < n; ++i) { for (j = i+1; j < n; ++j) { compare += 1.0; /* first check if either is equal --> get no benefit */ if (fabs(actual[i]-actual[j]) <= delta1) { continue; } if (fabs(predicted[i]-predicted[j]) <= delta2) { continue; } /* now check if they are correct or incorrect */ if ((actual[i] > actual[j]) && (predicted[i] > predicted[j])) total += 1.0; else if ((actual[i] < actual[j]) && (predicted[i] < predicted[j])) total += 1.0; else total += -1.0; /* we have a missed rank match */ } } if (compare == 0.0) return(0.0); return(total/compare); }
UCSF cancer center
Paired data
Signed rank test (Wilcoxon) ♦ We have (Xi,Yi) for n samples ♦ Definition • Compute all differences (Xi-Yi) • Sort them, low to high, based on absolute value • Assign ranks to each • Multiple each rank associated with a negative difference by -1 • Sum the negative ranks and positive ranks • Take the smaller magnitude sum: This is your statistic
♦ Again, tables are available for small n ♦ An approximation is available for large n
UCSF cancer center
Conclusions: Non-parametric statistics
Non-parametric statistics reduce reliance on distributional assumptions about your data ♦ They often give very sensitive tests ♦ Generally though, the corresponding parametric tests are more sensitive when it their assumptions hold ♦ Note that the process is generally the same • Compute your statistic • Look up a significance value or compute one from an approximation
Resampling and permutation-based methods move toward deriving everything from the data observed
UCSF cancer center
BP-203: Foundations of Mathematical Biology Statistics Lecture III: October 30, 2001, 2pm Instructor: Ajay N. Jain, PhD Email: [email protected]
Copyright © 2001 All Rights Reserved
UCSF cancer center
Lecture III: Resampling and permutation-based methods
Resampling methods ♦ Efron’s bootstrap and related methods ♦ Resample with replacement from a population distribution constructed from the empirical discrete observed frequency distribution
Permutation-based methods ♦ Shatter the relationship on which your statistic is computed ♦ Empirical method to derive the null distribution of a particular statistic given the precise context in which it is applied
We will focus on hypothesis testing and will address the problem of multiple comparisons
UCSF cancer center Unpaired data, multiple classes We have computed some statistic based on the class assignments The idea is to empirically generate the null distribution of this statistic by repeatedly randomly permuting the class membership
Two basic cases Paired data, or, more generally, vectorial data of any dimension Each sample has a vector We compute a statistic that depends on the relationship of variables within the vectors over all samples The idea is to empirically generate the null distribution of this statistic by repeatedly randomly permuting parts of the vectors across the samples
UCSF cancer center
Unpaired data: Rewrite our statistical function to take values+classes
* * * = X ,Y , Z
X 1... X n1
V1...Vn1 + n2 + n3
Y1...Yn2
C1...Cn1 + n2 + n3 = class
Z1...Z n3 * * * f ( X ,Y , Z )
* * f ' (V , C )
We need the distribution of f’ under either of the following ♦ Random permutations of the order of the vector V ♦ Random resamplings of V (with replacement) from V itself (this is equivalent to the bootstrap procedure that Jane described)
Note: We have essentially converted this to a paired data set
UCSF cancer center
What is the intuition?
Each permutation or resampling is a simulated experiment where we know the null hypothesis is true By generating the empirical distribution of f’ under many random iterations, we get an accurate picture of the likelihood of observing a statistic of any magnitude given the exact distributional and size characteristics of our samples To assess significance of a statistic of value Z ♦ Perform N permutations as described ♦ Compute f’ for each ♦ Count the number that meet or exceed Z (= nbetter) ♦ Significance = nbetter/N (the probability that we will observe a statistic as good as Z under the null hypothesis)
UCSF cancer center
Example: Alternative to T test
We sample from the standard normal distribution to get X1…X10 and Y1…Y10 Using the T test with equal variances, we compute the following from our sample:
X −Y
t= S
1 1 + n m
S=
The critical value for this test given 18 DF is 2.101 If we do this 100,000 times to check how accurate the critical value is, we get a proportion of 0.0509 t scores that exceed this critical value
(n − 1)U + (m − 1)V So, statistical theory works fine. n+m−2 How does permutation?
UCSF cancer center
The permutation approach yields similar results
Permutation simulation We sample from the standard normal distribution to get X1…X10 and Y1…Y10
What proportion of instances yield p values better than 0.05?
We compute the T test with equal variances
As with the critical value from statistical theory, we get an appropriate proportion by direct simulation
For each sample, we perform 1000 permutations, recompute t, and count the number better than our initial t.
With 10,000 simulations, 0.0483
If we resample with replacement, we get 0.0492
UCSF cancer center
We don’t have to use a “real statistic”
We sample from the standard normal distribution to get X1…X10 and Y1…Y10
What proportion of instances yield p values better than 0.05?
We compute the absolute difference of means
This is exactly the same as before. We did not have to normalize by the pooled variance.
For each sample, we perform 1000 permutations, recompute, and count the number better than our initial difference.
With 10,000 simulations, 0.0483
UCSF cancer center
Suppose our statistical assumptions fail?
We sample from the standard normal distribution to get X1…X10
What proportion of instances yield p values better than 0.05? ♦ T-test: 0.061 ♦ Permutation: 0.053
We sample from a normal distribution with mean 0, and variance 16 to get Y1…Y10 So, we don’t have equal variances
When our statistical assumptions fail, permutationbased methods give us a better estimate of significance
Paired data: Permute on the data pairing
UCSF cancer center
X 1... X n
X 1... X n
Y1...Yn
Y1...Yn
* * f ( X ,Y )
* * f ' ( X ,Y )
We need the distribution of f’ under either of the following ♦ Random permutations of the order of the vector Y ♦ Random resamplings of X and Y (with replacement) from X and Y themselves (like the bootstrap procedure)
Note that each of the Xi or Yi can be vectors themselves.
UCSF cancer center
Expression array example: Lymphoblastic versus myeloid leukemia
Lander data ♦ 6817 unique genes
REPORTS
♦ Acute Lymphoblastic Leukemia and Acute Myeloid Leukemia (ALL and AML) samples
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring
♦ RNA quantified by Affymax oligo-technology
T. R. Golub,1,2*† D. K. Slonim,1† P. Tamayo,1 C. Huard,1 M. Gaasenbeek,1 J. P. Mesirov,1 H. Coller,1 M. L. Loh,2 J. R. Downing,3 M. A. Caligiuri,4 C. D. Bloomfield,4 E. S. Lander1,5*
♦ 38 training cases (27 ALL, 11 AML) ♦ 34 testing cases (20/14)
We will consider whether any of the genes are differently expressed between the ALL and AML classes
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge. SCIENCE VOL 286 15 OCTOBER 1999
UCSF cancer center What happened when we applied the t test naively? We compute 6817 t-statistics (one for each gene) What is the critical value? ♦ P = 0.05 ♦ N = 27
What happened? We made 6817 independent tests of a statistic at a significance level of 0.05
Of the 6817 genes, 1636 are “significant”
We should expect about 341 genes to show up even if we have no real effect, assuming that our statistical assumptions are OK
Less than 40% of these are significant on the test set!
How can we use permutation to do a better job?
♦ M = 11 ♦ Degrees of freedom = 27+11-2 = 36 ♦ Critical value (two-tailed test): 2.03
UCSF cancer center
Permutation analysis in array data: Conservative approach is to take the max statistic
We are defining our new statistic to be one computed over the vector of all genes coupled to the class information We define our statistic to be the maximum of a particular statistic, computed for each gene
We will use two statistics ♦ Kendall’s Tau, measuring the rank correlation of gene expression levels against the AML/ALL classes represented as 0 and 1 ♦ The t statistic, functionally implemented on paired data of gene expression levels and classes represented as 0 and 1 ♦ For each case, we define our new statistic as the max(over all genes)
Permutation analysis in array data: Conservative approach is to take the max statistic
UCSF cancer center Sample
Genes 1…9
Class
1
0.99
0.98
0.98
0.97
0.97
0.95
0.95
0.95
0.96
1
2
1.15
1.11
1.07
1.04
1.01
0.99
0.98
0.96
0.96
1
3
1.11
1.14
1.22
1.3
1.37
1.39
1.39
1.39
1.37
1
4
1
1.01
1.01
0.99
0.96
0.93
0.91
0.89
0.88
1
5
1.04
1.01
0.97
0.94
0.93
0.92
0.9
0.9
0.91
1
6
1.17
1.25
1.32
1.38
1.43
1.46
1.5
1.53
1.55
0
7
1.12
1.16
1.2
1.26
1.34
1.42
1.49
1.54
1.53
0
8
0.96
0.97
0.97
0.97
0.96
0.96
0.97
0.98
0.98
0
9
1.03
1.04
1.05
1.06
1.07
1.09
1.1
1.12
1.17
0
10
1.16
1.19
1.21
1.23
1.25
1.25
1.26
1.27
1.28
0
0.16
0.24
0.18
0.27
0.27
0.27
0.38
0.38
0.42
Statistic for each gene Maximum magnitude statistic
UCSF cancer center
Permutation 1: Bogus correlation
Sample
Genes 1…9
Class
1
0.99
0.98
0.98
0.97
0.97
0.95
0.95
0.95
0.96
1
2
1.15
1.11
1.07
1.04
1.01
0.99
0.98
0.96
0.96
1
3
1.11
1.14
1.22
1.3
1.37
1.39
1.39
1.39
1.37
1
4
1
1.01
1.01
0.99
0.96
0.93
0.91
0.89
0.88
1
5
1.04
1.01
0.97
0.94
0.93
0.92
0.9
0.9
0.91
1
6
1.17
1.25
1.32
1.38
1.43
1.46
1.5
1.53
1.55
0
7
1.12
1.16
1.2
1.26
1.34
1.42
1.49
1.54
1.53
0
8
0.96
0.97
0.97
0.97
0.96
0.96
0.97
0.98
0.98
0
9
1.03
1.04
1.05
1.06
1.07
1.09
1.1
1.12
1.17
0
10
1.16
1.19
1.21
1.23
1.25
1.25
1.26
1.27
1.28
0
0.15
0.09
0.09
0.04
0.02
0.02
0.02
0.07
0.04
Statistic for each gene Maximum magnitude statistic
UCSF cancer center
Repeated permutation yields a cumulative distribution
Unadjusted critical value Permutation Based Estimation of Significance
♦ τ = 0.17
♦ Less than half confirmed on the test set
0.9 0.8
Cumulative Proportion
♦ Yields 1751 genes as “significant”
1
0.7 0.6 0.5 0.4 0.3 0.2
Adjusted critical value ♦ τ = 0.354 ♦ 51 genes significant ♦ 90% of these are confirmed on the test set
0.1 0 0.24
0.26
0.28
0.3
0.32
Max(τ)
0.34
0.36
0.38
0.4
From the cumulative distribution, we observe that τ = 0.354 corresponds to p = 0.05.
UCSF cancer center
We get similar results using the T test
Unadjusted critical value ♦ t = 2.03 ♦ Yields 1636 genes as “significant” ♦ Less than half confirmed on the test set
Is it safe to conclude anything about more than just the gene with the max statistic? ♦ Yes. ♦ If we were to generate the null distribution of the mth best gene, the 95th percentile would be lower than our initial critical value.
Is this estimate better than Bonferonni?
Adjusted critical value ♦ t = 5.16 ♦ 40 genes significant ♦ 80% of these are confirmed on the test set
♦ It can be. ♦ If there are strong cross-correlations in the data, this procedure is not penalized by the redundancy. ♦ The Bonferonni correction makes the implicit assumption that all variables are independent.
UCSF cancer center
CGH Analysis: Visualization and Correlation with Outcome
Data (J. Gray, K. Chin)
Is there a statistically significant correlation between CGH profile similarity and outcome (e.g. survival)?
♦ 60 CGH profiles • 1225 “observables” • 52 tumor profiles • 8 normal profiles
Are there relationships among the measured variables?
♦ Patient information • Age of onset • Overall survival • Disease free survival
Tumor and Normal CGH Profiles
• Alive or dead
♦ Tumor status • Estrogen receptor • Progesterone receptor
Log(Relative copy number)
• Size/Stage
0.4
0.2
0
-0.2
• p53 -0.4
1
2
3
4
5
6
7
8 9 10 Genomic Position
11
12
13
14
15
16
17 18 19 20 2122 X
We can visualize complex profile data using 3D virtual worlds
UCSF cancer center
val rvi Su
Alive
l) rm a o n / or (tum g o L
Dead tion loca e om Ge n
By sliding the opaque XZ plane, we can select peaks above background
UCSF cancer center
Normals shown in white at survival = -1 month One remaining background peak from normals
UCSF cancer center
One particular locus sticks out CHR 9
♦ The center of this valley is on chromosome 9 ♦ The normal profiles show a slight depression there as well ♦ Is this locus significant?
UCSF cancer center
Bad news: the correlations appear to be no better than chance at p = 0.05 Correlation magnitude with overall survival
♦ Many other peaks
Compute level of significance using permutation analysis We get a critical value of 0.36
0.25
0.2
0.15
0.1
0.05
0 1
2
3
4
5
6
7
8 9 10 Genomic Position
11
12
13
14
15
16
17 18 19 20 2122 X
Cumulative Histogram of Correlation Magnitudes 1 0.9 0.8
p = 0.05 threshold is 0.36
0.7
Cumulative Proportion
♦ Strongest correlation at 8q24
0.3
Correlation Magnitude
We compute the direct correlation for each of 1225 loci
0.35
0.6 0.5 0.4 0.3 0.2 0.1 0 0.15
0.2
0.25
0.3
0.35
Correlation Magnitude
0.4
0.45
0.5
UCSF cancer center
General Principle: Reduce the number of observations
Any method we can use to subselect a smaller set of observations from the larger set helps us, provided: ♦ The subselection method must be orthogonal to the correlation being studied • If we’re trying to link copy number to survival, we can’t systematically employ the survival outcomes in making our subselection
♦ Ideally, the method should have some compelling intuitive support based on the data ♦ Restricting observations based on frequency/magnitude is a generally useful technique: it tends to eliminate noise
The magnitude of copy number variation is not uniformly distributed ♦ 9q13 has the largest cumulative variation
CGH variation energy 7
6
5
4 Energy
UCSF cancer center
By including frequency and amplitude, we can detect weaker correlations
3
2
1
♦ 8q24 has the next highest
0 1
2
Significance thresholds on correlation vary with “energy”
3
4
5
6
♦ Energy 6.0, t = 0.19
11
12
13
14
15
16
17 18 19 20 2122 X
Cumulative Histogram of Correlation Magnitudes with Multiple Energies
0.9 0.8
Cumulative Proportion
♦ Energy 3.0, t = 0.31
8 9 10 Genomic Position
1
E = 6.0, p = 0.05 threshold is 0.19
0.7
♦ Energy 0.0, t = 0.36
7
0.6 0.5
E = 3.0, p = 0.05 threshold is 0.31
0.4 0.3 0.2 0.1 0 0
0.05
0.1
0.15
0.2 0.25 0.3 Correlation Magnitude
0.35
0.4
0.45
0.5
Both 8q24 and 9q13 are significantly correlated with survival
UCSF cancer center
Correlation Magnitude with Overall Survival 0.35
0.3
Correlation Magnitude
0.25
0.2
0.15
0.1
0.05
0 1
2
3
4
5
6
7
8 9 10 Genomic Position
11
12
13
14
15
16
17 18 19 20 2122 X
UCSF cancer center
Amplification at 8q24: poorer survival (p < 0.01) Kaplan-Meier Plot of Normal vs Amplified at 8q24
1
Fraction Surviving
0.8
Normal
0.6
0.4
Amplified 0.2
0 0
20
40
60
80 Survival Duration
100
120
140
160
UCSF cancer center
Deletion at 9q13: poorer survival (p < 0.01) Kaplan-Meier Plot of Normal vs Deleted at 9q13
1
Fraction Surviving
0.8
0.6
Normal
0.4
Deleted 0.2
0 0
20
40
60
80 Survival Duration
100
120
140
160
UCSF cancer center
Clustering based on chromosomes 8 and 9 reveal patterns of survival and tumor phenotype
Cluster profiles based on Chr 8,9 ♦ Display raw data ♦ Display survival, p53 status
Cluster enrichment is statistically significant ♦ Orange block • Surv < 35 months • p53 often mutant
♦ Yellow block • Surv > 75 months
MT64_mt: MT67_wt: MT160_mt: MT101_mt: MT221_wt: MT264_wt: MT46_wt: 107B_mt: MT60_wt: MT132_mt: MT24_mt: MT54_wt: MT21_wt: 012.10-NOR: MT5_wt: 125.10-NOR: 020.10-NOR: MT17_wt: MT44_wt: MT3_mt: MT18_wt: MT19_wt: 406A_wt: 123B_mt: 406B_wt: MT31_wt: MT65_mt: 011.10-NOR: MT181_mt: 017.10-NOR: 035.10-NOR: 012.20-NOR: 016.10-NOR: MT38_wt: MT57_wt: MT43_wt: MT20_mt: MT418_mt: MT59_mt: 309A_mt: UT274_mt: MT112_mt: MT161_mt: 208A_mt: MT51_wt: UT250_mt: UT065_mt: UT252_mt: 101A_mt: UT009_mt: 405A_mt: MT49_wt: UT164_mt: MT209_wt: 111A_mt: MT29_mt: 214A_wt: MT61_wt: MT342_wt: 111B_mt: sv
N14
N11 N16
N28 N26 N31
N39
N30 N22 N18 N17 N9 N21 N3 N5
N20
N29 N25
N32 N15 N7 N10 N23 N6 N4 N2 N8 N19 N1 N13 N27 N0 N24
N49
N12
N55
N48 N41
N33
N53 N58 N52
N46 N44 N42
N35
N38 N47 N45 N43 N40
N34 p53
chr8-9
• p53 often wt
p53 status (green = mut, black = wt) Survival (black = low, green = high)
N37 N36
N57 N56 N54 N51 N50
Deletion at 5q11-31 and amplification at 8q24 are correlated with mutant p53
UCSF cancer center
Correlation magnitude with p53 status 0.35
0.3
Correlation Magnitude
0.25
0.2
0.15
0.1
0.05
0 1
2
3
4
5
6
7
8 9 10 Genomic Position
Some genes on 5q: APC and IL3
11
12
13
14
15
16
17 18 19 20 2122 X
UCSF cancer center
Conclusions on permutation and resampling methods
Permutation and resampling methods offer a means to replace complex assumptions with counting. We can generalize the concept of a statistic to any computable value and apply permutation methods to judge significance. This can be directly applied in addressing the problem of multiple testing in array-based data. If we can reduce the number of tests based on an orthogonal observation, we gain statistical power.
Further reading ♦ Resampling-Based Multiple Testing : Examples and Methods for P-Value Adjustment by Peter H. Westfall, S. Stanley Young ♦ Jain AN, Chin K, Borresen-Dale AL, Erikstein BK, Eynstein Lonning P, Kaaresen R, Gray JW. Quantitative analysis of chromosomal CGH in human breast tumors associates copy number abnormalities with p53 status and patient survival.Proc Natl Acad Sci U S A. 2001 Jul 3;98(14):7952-7. ♦ Dudoit, S., Yang, Y.H., Callow, M.J., and Speed, T. (2000) Statistical methods for identifying differentially expressed genes. Unpublished (Berkeley Stat Dept. Technical Report #578). (To appear: JASA) ♦ Tusher V, Tibshirani R, and Chu G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98: 5116-5124.
UCSF cancer center
So how do we design array-based experiments?
General scheme ♦ Use P samples to screen large number of variables (N) to select a much smaller number (M) • Expectation, despite multiple comparisons, is that the highest ranked variables contain true effects if they exist • We must pick M such that for a particular effect size, it is very likely that our M will include a true effect of the specified size
♦ On K new samples, screen the M variables in order to identify the true effects with reasonable power
So how do we pick P, N, M, and K? ♦ Pick N based on experimental considerations: What pool of variables do you need to consider? ♦ Pick P based on practical considerations: you probably won’t be able to pick P large enough to get adequate power. ♦ Pick M such that, with preliminary data, the null distribution of the Mth strongest effect makes it very likely that if an effect of the size you want exists, it will be within the top M. ♦ Now choose K such that with M variables, you have adequate power to see an effect of the size you want to find.
So how do you choose an effect size? ♦ Based on what is of practical significance ♦ Note: you can play with the effect size to modulate your power. This is a nasty business though.
UCSF cancer center
Homework: Due November 6th
Problem 1: You wish to design an experiment that will identify a gene that is highly expressed in cervical scrapes from women with cervical disease but that is poorly expressed in samples from women free of disease. A preliminary experiment gives you 40,000 mean gene expression values from normal cervical samples measured versus a pooled reference RNA of which you have a large quantity. These values are G1…G40,000. For your initial experiment, to prove feasibility, you have 5 disease samples and 5 normal samples. You will measure gene expression in these versus the same pooled reference as above. This will yield 1…5D 1…5F 1…40,000 and 1…40,000 corresponding to the gene expression values for diseased and disease-free samples.
Mathematically define the characteristics of a gene that would serve as a good disease marker. Given that we have 40,000 variables and just 10 samples, is there a possibility that you will find a statistically defensible result to support your grant? Construct an example of data values to defend your answer. How can you use permutation analysis to quantify the significance of your best nominal gene?
UCSF cancer center
Homework: Due November 6th
Problem 2: You take the best nominal gene from Problem 1 (significant or not) and are to design a follow-up experiment to test its utility as a marker. This is not a clinical trial design, but rather a step toward that. You wish only to confirm that expression is higher in the disease samples than in the normal samples. Suppose that the for the disease samples, Gene X had mean expression of 7.0 and sample standard deviation 1.2. For the normal samples, Gene X had mean expression 5.3 with SD 1.1. Assume that you will use equal numbers of disease and normal samples in this followup experiment.
How many new samples of cancer and normal do you estimate you will need in order to obtain a result suggesting that the gene expression difference is significant at p = 0.05? (Hint: You may need to use the table of the distribution of t from Lecture I.) If you used a parametric statistic for above, is there a way to reduce the number of samples required by instead using a nonparametric test? Please make a statistical or probabilistic argument to support your answer. Suppose you decide to follow up on the best 100 genes from your initial experiment and you want to see if any of them are significant. Does this affect your sample size calculation? How can you use permutation analysis on your preliminary data to make a good estimate of how many samples you would need?
UCSF cancer center
Homework: Due November 6th
Problem 3: Is there any theoretical difficulty with performing an experiment and then testing a large number of different statistics and picking the one that suggests a significant result?
Problem 4: Suppose you have a pathological case for Pearson’s r: 9 points where X and Y are chosen from standard normal distributions and one outlier point at (1000,1000). Quite often, you will observe an anomalously high r, which will appear significant according to statistical theory. ♦ Do you expect that a permutationbased method will give you a more pessimistic estimate of significance in this case? ♦ Do you expect a significant difference between a pure permutation approach versus a resampling approach (with replacement)?
Outline
The Elements of Statistical Learning: Data Mining, Inference, and Prediction by T. Hastie, R. Tibshirani, J. Friedman (2001). New York: Springer www.springer-ny.com
Read Chapters 1-2 Linear Regression [Chapter 3] Model Assessment and Selection [Chapter 7] Additive Models and Trees [Chapter 9] Classification [Chapters 4, 12] Note: NOT data analysis class; see www.biostat.ucsf.edu/services.html
8
PCR Calibration
6
•
4
•
•
2
•
• 0
Log Copy Number
•
• 15
20
25
30 Cycle Threshold
35
40
45
Example: Calibrating PCR
x: number of PCR cycles to achieve threshold y: gene expression (log copy number)
Typical output from statistics package: Residual SE = 0.3412,
R-Square = 0.98
F-statistic = 343.7196 on 1 and 5 df
coef std.err
t.stat p.val
Intcpt
10.1315
0.3914
25.8858
0
x
-0.2430
0.0131 -18.5397
0
Fitted regression line: y = 10 1315 All very nice and easy. :
0:2430x
Example: Calibrating PCR continued In reality, things are usually more complicated:
Have calibrations corresponding to several genes and/or plates ) how to synthesize?
Potentially have failed / flawed experiments ) how to handle? Often have measures on additional covariates (e.g., temperature) ) how to accommodate?
Can have non-linear relationships (e.g., copy number itself), non-constant error variances (e.g., greater variability at higher cycles), and non-independent data (e.g., duplicates)
) how to generalize?
8
PCR Calibration 1 2 3 4 5
6
14 3 52
IL14 NKCC1 stat-1b chymase IL9RA
352
352
4
41
41
352
2
41
53 2
13
25
4
0
Log Copy Number
41
5 3 1 2 4 15
20
25
30 Cycle Threshold
35
40
45
Simple Linear Regression
Data: f(xi yi)gNi=1 ;
xi: explanatory / feature variable; covariate; input yi: response variable; outcome; output sample of N covariate, outcome pairs.
Linear Model: yi = β0 + β1xi + εi
Errors: ε – mean zero, constant variance σ2 Coefficients: β0 – intercept; β1 – slope Estimation: least squares – select coefficients to minimize the Residual Sum of Squares (RSS) N
RSS(β0; β1) = ∑(yi i=1
(β0 + β1xi))
Solution, Assumptions: exercise
2
Multiple Linear Regression (Secn 3.2)
Data: f(xi yi)gNi=1 where now each xi is a ;
covariate vector xi = (xi1; xi2; : : : ; xip)T
Linear Model: p
yi
=
β0 + ∑ β j xi j + εi
y
=
Xβ + ε
j=1
where X is N ( p + 1) matrix with rows (1; xi); y = (y1; y2; : : : ; yN )T ; ε = (ε1; ε2; : : : ; εN )T
Estimation of β = (β0 β1 ;
RSS(β) = (y
;:::;
β p)T : minimize
Xβ)T (y
Xβ)
Least Squares Solution: βˆ = (XT X) 1XT y Issues: inference, model selection, prediction, interpretation, assumptions/diagnostics, : : :
Inference
Sampling Variability:
Assume yi uncorrelated,
constant variance σ2 Covariance matrix of βˆ is Var(βˆ ) = (XT X) 1σ2 Unbiased estimate of σ2 is σˆ 2 =
N
1 p
N
1 i∑ =1
(yi
yˆi)2
where (yˆi) = yˆ = Xβˆ = X(XT X) 1XT y = Hy H projects y onto yˆ in the column space of X.
Tests,
Intervals: Now assume εi independent,
identically distributed N (0; σ2) (ε N (0; σ2I ))
(N
N (β (XT X) 1)σˆ 2 σ2χ2N p 1 βˆ
Then p
;
σ
1 2
)
To test if the jth coefficient β j = 0 use βˆ j zj = p σˆ v j where v j is the jth diagonal element of (XT X) 1. To simultaneously test sets of coefficients (e.g. related variables): p1 + 1 terms in larger subset; p0 + 1 in smaller subset. RSS1 (RSS0): residual sum of squares for large (small) model. Use F statistic:
F
=
(RSS0
RSS1)=( p1 p0) RSS1=(N p1 1)
which (if small model correct) Fp1 A1
p0;N p1 1.
2α confidence interval for β j : ˆj (β
where z1
α
is 1
z1 αv j σˆ ; βˆ j + z1 αv j σˆ ) α percentile of the normal dn.
Variable Selection (Secn 3.4) Objectives: improve prediction, interpretation Achieved by eliminating/reducing role of lesser/ redundant variables. Many different strategies/criteria. Subset Selection: retain only a subset. Estimate coefs of retained variables with least squares as usual. Best subset regression: for k = 1; : : : p find those k
variables giving smallest RSS. Feasible for p 30.
Forward stepwise selection: start with just intercept; sequentially add variables (one at a time) that most improve fit as measured by F statistic (7). Stop when no variable significant at (say) 10% level. Backward stepwise elimination: start with full model; sequentially delete variables that contribute least.
Prostate Cancer Example Subjects: 97 potential radical prostatectomy pts Outcome: log prostate specific antigen (lpsa) Covariates: log cancer volume (lcavol), log prostate weight (lweight), age, log amount of benign hyperplasia (lbph), seminal vesicule invasion (svi), Gleason score (gleason), log capsular penetration (lcp), percent Gleason scores 4 or 5 (pgg45). Term
Value StdError tvalue Pr(>|t|)
Intercept
0.6694
1.2964
0.5164
0.6069
lcavol
0.5870
0.0879
6.6768
0.0000
lweight
0.4545
0.1700
2.6731
0.0090
0.0112 -1.7576
0.0823
age -0.0196 lbph
0.1071
0.0584
1.8316
0.0704
svi
0.7662
0.2443
3.1360
0.0023
0.0910 -1.1589
0.2496
lcp -0.1055 gleason
0.0451
0.1575
0.2866
0.7751
pgg45
0.0045
0.0044
1.0236
0.3089
Prostate Cancer: Correlation Matrix
lcv
lwt age
lbh
svi
lcp
gle
pgg lpsa
lcavol 1.00
0.194 0.2
0.027
0.54
0.675
0.432 0.43
0.7
lweight 0.19
1.000 0.3
0.435
0.11
0.100 -0.001 0.05
0.4
age 0.22
0.308 1.0
0.350
0.12
0.128
0.269 0.28
0.2
lbph 0.03
0.435 0.4
1.000 -0.09 -0.007
0.078 0.08
0.2
svi 0.54
0.109 0.1 -0.086
1.00
0.673
0.320 0.46
0.6
lcp 0.68
0.100 0.1 -0.007
0.67
1.000
0.515 0.63
0.5
0.078
0.32
0.515
1.000 0.75
0.4
gleason 0.43 -0.001 0.3 pgg45 0.43
0.051 0.3
0.078
0.46
0.632
0.752 1.00
0.4
lpsa 0.73
0.354 0.2
0.180
0.57
0.549
0.369 0.42
1.0
Prostate Cancer: Forward Stepwise Selection lcavol lweight age lbph svi lcp gleason pgg45 1
T
F
F
F
F
F
F
F
2
T
T
F
F
F
F
F
F
3
T
T
F
F
T
F
F
F
4
T
T
F
T
T
F
F
F
5
T
T
T
T
T
F
F
F
6
T
T
T
T
T
F
F
T
7
T
T
T
T
T
T
F
T
8
T
T
T
T
T
T
T
T
Residual sum of squares: 1
2
3
4
5
6
7
8
58.9
52.9
47.7
46.4
45.5
44.8
44.2
44.1
F-statistics for inclusion: 1 111.2
2
3
4
5
6
7
8
10.5
10.0
2.5
1.9
1.3
1.3
0.1
Prostate Cancer: Forward Stepwise Selection
50
•
• • • 45
residual sum of squares
55
•
• • 2
4
6 size
• 8
Prostate Cancer: Backward Stepwise Selection
100 80 60
residual sum of squares
120
•
• • •
0
•
2
4 size
•
• 6
•
Coefficient Shrinkage (Secn 3.4) Selection procedures interpretable. But – due to in / out nature – variable ) high prediction error. Shrinkage continuous: reduces prediction error.
(N
)
Ridge Regression: shrinks coefs by penalizing size: βˆ ridge = arg min β
Center X : xi j
∑
i =1
xi j
x¯ j
j=1
j=1
ˆ 0 = y¯) (β
Minimize RSS(β; λ)
= (y
βˆ ridge
= (X
Solution
p
2 2 β x ) + λ β ∑ j ij ∑ j
β0
(yi
p
X
Xβ)T (y T
N p. Xβ) + λβT β
X + λI) 1XT y
Now nonsingular even if XT X not full rank. Interpretation via SVD: pp 60 - 63. Choice of λ?? Microarray applications??
Coefficient Shrinkage ctd (Secn 3.4) The Lasso: like ridge but with L1 penalty: βˆ lasso = arg min β
(N
∑(yi
i =1
β0
) 2 β x ) + λ ∑ jβ j j j i j ∑ p
p
j =1
j=1
The L1 penalty makes the solution nonlinear in y
) quadratic programming algorithm. Why use? – small λ will cause some coefs to be exactly zero ) synthesizes selection and shrinkage: interpretation and prediction error benefits. Choice of λ?? Microarray applications??
Model Assessment and Selection
Generalization performance of a model pertains to its predictive ability on independent test data.
Crucial for model choice and quality evaluation. These represent distinct goals: Model Selection: estimate the performance of a series of competing models in order to choose the best. Model Assessment: having chosen a best model, estimate its prediction error on new data.
Numerous criteria, strategies.
Bias, Variance, Complexity Secn 7.2
Outcome Y (assume continuous); input vector X; prediction model fˆ(X ).
L(Y
;
fˆ(X )): loss function for measuring errors
8 < (Y fˆ(X ) = : jY
between Y and fˆ(X ). Common choices are: L(Y ;
fˆ(X ))2 squared error fˆ(X )j absolute error
Test or generalization error: expected prediction error over independent test sample Err = E[L(Y ; fˆ(X )] where X ; Y drawn randomly from their joint distribution.
Training error: average loss over training sample: 1 N err = ∑ L(yi; fˆ(xi)) N i=1
Bias, Variance, Complexity ctd
Typically, training error
<
test error because same
data is being used for fitting and error assessment. Fitting methods usually adapt to training data so err overly optimistic estimate of Err.
Part of discrepancy due to where evaluation points occur. To assess optimism use in-sample error: 1 N Errin = ∑ EY new [L(Yinew; fˆ(xi)] N i=1
Interest is in test or in-sample error of fˆ ) Optimal model minimizes these. Assume Y = f (X ) + ε E(ε) = 0 Var(ε) = σ2ε . ;
;
Bias, Variance, Complexity ctd Expected prediction error of fit fˆ(X ) at input point X
= x0
under squared error loss:
Err(x0)
fˆ(x0))2jX
=
E[(Y
=
σ2ε + [E fˆ(x0) E[ fˆ(x0)
= x0]
f (x0)]2 + E fˆ(x0)]2
=
σ2ε + Bias2( fˆ(x0)) + Var( fˆ(x0))
=
Irreducible Error + Bias2 + Variance:
First term: variance of the outcome around its true mean f (x0); unavoidable. Second term: squared bias – amount by which average of estimate fˆ(x0) differs from true mean. Third term: variance – expected squared deviation of estimate around its mean.
Bias, Variance, Complexity ctd
For a linear model LS fit fˆp(x) = βˆ T x we have Err(x0) = E[(Y fˆp(x0))2jX = x0] 2 f (x0)]2 + jjh(x0)jj2σ2ε = σε + [E fˆp(x0)
:
Here h(x0) is the weight vector producing the fit: fˆp(x0) = x0T (XT X) 1XT y.
So, Var[ fˆp(x0)] = jjh(x0)jj2σ2ε . While this variance changes with x0, its average (over sample values xi) is ( p=N )σ2ε .
Hence, in-sample error is N 1 1 N 2 Err ( x ) = σ [E fˆp(xi) i ∑ ∑ ε+ N i=1 N i =1
f (xi)]
2
p 2 + σε : N
Here model complexity directly related to the number of parameters p – will generalize later.
Bias, Variance, Complexity ctd
Ridge regression has identical form for test error. But weights in variance term are different: h(x0) = x0T (XT X + λI) 1XT . Bias also different.
Consider a linear model family (incl ridge regn): β parameters of best fitting linear approx to f : β = arg minβ EX ( f (X ) [ f (x0) [ f (x0)
βT X )2: Squared bias is
E fˆλ(x0)]2 = βT x0]2 + [βT x0
EβTλ (x0)]2:
First term: model bias – error between best fitting linear approx and true function. Second term: estimation bias – error between the average estimate (EβTλ (x0)) and best linear approx.
Bias, Variance, Complexity ctd
For linear models, fit by LS, estimation bias = 0. For restricted fits (e.g., ridge) it is positive – but have reduced variance.
Model bias can only be reduced by enlarging the class of linear models to a richer collection of models. Can be accomplished by inclusion of interaction terms or covariate transformations (e.g., SVMs, additive models –later).
Optimism of Training Error Secn 7.4
Training error typically less than true error. Define the optimism as op Errin E(err) For squared error and other loss functions have :
2 N op = ∑ Cov(yˆi; yi) N i=1
) the amount by which err underestimates the true error depends on how strongly yi affects its own prediction. The harder we fit the data, the greater Cov(yˆi; yi), thus increasing the optimism.
If yˆi is from a linear fit with p covariates Cov(yˆi; yi) so
Errin
=
pσ2ε
=
p 2 E(err) + 2 σε N
Estimation of Prediction Error Secn 7.5
General form of in-sample estimates is
c
c
Errin = err + op:
Applying to linear model with p parameters fit under squared error loss gives the C p statistic: p 2 C p = err + 2 σˆ ε : N Here σˆ 2ε is an estimate of the error variance obtained from a low-bias (large) model. Under this criterion we adjust the training error by a factor proportional to the number of covariates used.
Akaike Information Criterion is a generalization to situation where a log-likelihood loss function is used, e.g., binary, Poisson regression.
Criterion Selection Functions
Generic form for AIC is AIC = 2 loglik + 2 p Bayes information criterion (BIC) (Secn 7.7) is BIC = 2 loglik + log N p For N e2 7 4, BIC penalty AIC penalty ) BIC favors simpler models. Many variants; new feature – adaptive penalties. When log-lik based on normal distn we require >
:
>
an estimate for σ2ε . Typically obtained as mean
squared error of low-bias model ) problematic. Cross-validation does not require this.
Effective Number of Parameters Secn 7.6
The Cp or AIC criteria have an optimism estimate (penalty) that involves number of parameters p.
If covariates are selected adaptively then no longer have Cov(yˆi; yi) = pσ2ε ; e.g., total of p covariates and we select the best-fitting model with q < p covariates, optimisim will exceed (2q=N )σ2ε . By choosing best-fitting model with q covariates, the effective number of parameters is > q.
Linear fitting methods:
yˆ = Sy where S is N
N matrix depending only on covariates xi (not yi). Includes regression, methods using quadratic penalties such as ridge, cubic smoothing splines. Define enp as d (S) = trace(S).
Cross-Validation Secn 7.10
Simplest method for estimating prediction error. Estimates extra-sample error Err = E[L(Y fˆ(X )]. With enough data (large N) set aside portion as ;
validation set. Use to assess model performance.
Not feasible with small N ) CV offers a finesse. Randomly partiton data into K equal-sized parts. For kth part, fit model to other K
1 parts. Then
calculate prediction error of resultant model when applied to kth part. Do this for k = 1; : : : ; K and combine the prediction error estimates.
Let κ : f1
;:::;
N g 7! 1; : : : ; K map observations to
their assigned partition. Let fˆ function with kth part removed.
k
(x)
denote fitted
Cross-Validation ctd Then CV prediction error estimate is 1 N CV = ∑ L(yi; fˆ N i=1
κ(i)
(xi)):
Given a set of models f (x; α) indexed by tuning parameter α (e.g., ridge, lasso, subset, spline) set 1 N CV(α) = ∑ L(yi; fˆ N i=1
κ(i)
(xi; α)):
Find αˆ minimizing CV(α) and fit chosen model f (x; αˆ ) to all the data.
K = N: leave-one-out CV – approx unbiased for true prediction error but can be highly variable.
K = 5: lower variance but bias can be a problem. Generally K = 5 or 10 recommended but clearly depends on N ) microarray applications??
Gene Harvesting
Hastie, Tibshirani, Botstein, Brown (2001). genomebiology.com/2001/2/1/research
First cluster genes using hierarchical clustering. Obtain average expression profiles from all clusters. These serve as potential covariates, in addition to individual genes.
The use of clusters as covariates biases toward correlated sets of genes; reduces overfitting.
Forward stepwise algorithm; prescribed # terms. Provision for interactions with included terms. Model choice (# terms) via cross-validation.
5.5
6.0
6.5
5.0
5.5
6.0
7
9
9 5
2
Single Linkage
4
8.0
Average Linkage
11
1
7.5
3
6 8
7.0
2 10 6 8
10
5
3
1
11
4
Hierarchical Clustering
7
6.5
Kappa Opioid / Harvesting / Average Linkage Step Node Parent Score Size 1
6295
0 22.40 687
2
1380
6295 19.67
6
3
663
0 15.62
2
4
3374
663 10.69
3
5
1702
0 12.92
2
6
6268
663 11.27 83
y = β0 + β1x¯Node6295 + β2(x¯Node1380 x¯Node6295) + : : :
Kappa Opioid / Harvesting / Single Linkage Step Node Parent Score Size 1
g3655
0 21.97
1
2
2050 g3655 20.62
3
3
g900 g3655 16.91
1
4
g1324 g3655 16.01
1
5
g1105 g3655 24.34
1
6
g230 g3655 12.44
1
y = β0 + β1xGene3655 + β2(x¯Node2050 xGene3655) + : : :
4*10^6
Kappa Opioid: 5-fold CV Error Variance
2*10^6 10^6 0
Residual Variance
3*10^6
Clustered Genes Original Genes Training Error
1
2
3
4 Terms
5
6
7
0
20
40
60
80
100
Gene Harvesting: Kappa-Opioid
-0.4
-0.2
0.0
0.2
0.4
Correlations: Node 6295
0.6
0.8
1.0
150
200
Gene Harvesting: Kappa-Opioid
0
50
100
Node score = 22.4!
0
2
4
6
8
Scores: Node 6295
10
12
14
500000
Kappa Opioid: 10-fold CV Error Variance
300000 200000 100000 0
Residual Variance
400000
Constrained Harvesting Training Error
1
2
3
4 Terms
5
6
Smoothing
Recall simple linear model: E(Y jX ) = β0 + β1X Dependence of E(Y ) on X not necessarily linear. Can extend model by adding terms, e.g., X 2 ) problematic: what terms? when to add? What is desirable is to have 1. the data dictate appropriate functional form without imposing rigid parametric assumptions, 2. a corresponding automated fitting procedure.
Key concepts: locally determined fit. Issues: what is local? how to fit? Resultant methods: (scatterplot) smoothers. Resultant model: E(Y jX ) = β0 + s(X; λ)
•
• •
5
span = 10% span = 25% span = 100%
•
4
• ••
• •
3 2 1
•
••• •• •• • • •• •• •• •• • • ••
•
•• •
•
•
•
•
•
•
•
• •
•
•
•
•
• •
• • • • •
••
•
•
•
• •
•
• •
•
•
••
•
•
•
• •
• 0
log(PSA)
• •
•
• • -1
0
1 log (Capsular Penetration)
2
3
Smoothing Splines
Avoid knot selection problem by regularization. For all fns f with two cts derivatives minimize Z N RSS( f λ) = ∑fyi f (xi)g2 + λ f f (t )g2 dt ;
00
i =1
First term measures closeness to data, second term penalizes curvature in f ; λ effects trade-off: λ = 0 : f any interpolating function, (very rough) λ = ∞ : f simple least squares fit (very smooth).
Soln: natural cubic spline with knots at unique xi. Linear smoother: ˆf = ( fˆ(xi)) = Sλy. Calibrate smoothing parameter λ via dfλ = trace(Sλ).
Pick λ by cross-validation; GCV.
Additive Models
Multiple linear regression: Xp) = β0 + β1X1 + E(Y jX1
:::+
β pX p
Additive model extension: Xp) = β0 + s1(X1) + E(Y jX1
:::+
s p(Xp)
;:::;
;:::;
Estimation of s j via backfitting algorithm: 1. Initialize: βˆ 0 = N1 ∑Ni=1 yi; sˆ j 0 8 j. 2. Cycle: j = 1; 2; : : : ; p; : : : ; 1; 2; : : : ;
"
sˆ j
Smooth j
fyi
βˆ 0
N s ˆ ( x ) g ∑ k ik 1
#
k= j 6
until the sˆ j converge.
Same generalization – replacing linear predictor with sum of smooth functions – and backfitting method applies to binary, count outcomes.
-1.0
-0.6
-1.0
-0.4
0.0
0.2
-1 1.0
2
0 lbph 1 3
2 0.5
0.5
0.6
1
lcavol 4 3
-1 4 5
0 lcp 1 2 3 0.0
0.4
0
s(pgg45)
s(lcp)
0.0
s(lbph)
-1
-0.5
-0.5
-0.2
-1.0
-2
-1
-0.5
-1
0.0
0.5
s(age)
0
s(lweight)
0
s(lcavol) 1
1.0
1
2
1.5
Prostate Cancer: Additive Model Fits
lweight
6 40 50 age
0 20
60
40 pgg45
70
60 80
80
100
Prostate Cancer: Additive Model Df
NparDf
Npar F
Pr(F)
s(lcavol)
1
3
1.15
0.33
s(lweight)
1
3
1.65
0.18
s(lcp)
1
3
2.11
0.10
s(pgg45)
1
3
1.15
0.33
Initial Model: lpsa ˜ s(lcavol) + s(lweight) + s(lcp) + s(pgg45) Final Model: lpsa ˜ lcavol + lweight + s(lcp) + s(pgg45) From
To
Df Resid Df
1 2
s(lweight) s(lweight, 2)
AIC
80
57.5
2
82
56.4
3
s(lcavol)
s(lcavol, 2)
2
84
55.6
4
s(lcavol, 2)
lcavol
1
85
55.3
5 s(lweight, 2)
lweight
1
86
55.3
Tree-Structured Regression Paradigm Tree-based methods involve four components: 1. A set of questions - splits - phrased in terms of covariates that serve to partition the covariate space. A tree structure derives from recursive splitting and a binary tree results if the questions are yes/no. The subgroups created by assigning cases according to splits are termed nodes. 2. A split f unction φ(s; g) that can be evaluated for any split s of any node g. The split function is used to assess the worth of the competing splits. 3. A means for determining appropriate tree size. 4. Statistical summaries for the nodes of the tree.
Allowable Splits An interpretable, flexible, feasible set of splits is obtained by constraining that 1. each split depends upon the value of only a single covariate, 2. for continuous or ordered categorical covariates, X j , only splits resulting from questions of the form “Is X j c ?” for c 2 domain(X j ) are considered; thus ordering is preserved, 3. for categorical covariates all possible splits into disjoint subsets of the categories are allowed.
Growing a Tree 1. Initialize: root node comprises the entire sample. 2. Recurse: for every terminal node, g, (a) examine all splits, s, on each covariate, (b) select and execute (create left, gL, and right, gR, daughter nodes) the best of these splits. 3. Stopping: grow large; prune back. 4. Selection: cross-validation, test sample. Best split determined by split function φ(s; g). y¯g = (1=Ng) ∑i2g yi outcome average for node g. Within node sum-of-squares: SS(g) = ∑i2g(yi Define φ(s; g) = SS(g)
SS(gL)
SS(gR).
Best split s such that φ(s; g) = maxs φ(s; g) Easily computed via updating formulae.
y¯g)2.
Prostate Cancer: Regression Tree
2.4780 | n=97 lcavol<2.46165 2.1230 n=76 lcavol<-0.478556 lcavol>-0.478556 0.6017 2.3270 n=9 n=67 lweight<3.68886 2.0330 n=38 pgg45<7.5 pgg45>7.5 1.7250 2.4130 n=21 n=17 lcavol<0.774462 lcavol>0.774462 1.2630 2.0100 n=8 n=13
lcavol>2.46165 3.7650 n=21 lcavol<2.79352 lcavol>2.79352 3.2840 4.2030 n=10 n=11 lweight>3.68886 2.7120 n=29 lcavol<0.821736 lcavol>0.821736 2.2880 2.9360 n=10 n=19
1.2
Prostate Cancer: Regression Tree
0.8 0.6 0.4 0.2 0.0
Relative Squared Error
1.0
Cross-Validation Training Error
0
2
4 Number of Splits
6
Prostate Cancer: Pruned Regression Tree
2.4780 | n=97
lcavol<2.46165 lcavol>2.46165
2.1230 n=76
3.7650 n=21
lcavol<-0.478556 lcavol>-0.478556
0.6017 n=9
2.3270 n=67
Peptide Binding: Background
Milik M, Sauer D, Brunmark AP et al., Nature Biotechnology, 16:753-6, 1998. Predict the amino acid sequences of peptides that bind to the particular MHC class I molecule, Kb.
The peptides of interest are 8-mers which may result from proteolysis of invading viral particles.
Some bind to class I MHC molecules. These complexes are presented on the infected cell surface where recognized by cytotoxic T lymphocytes which destroy the infected cell.
Hence, MHC binding is an essential prerequisite for any peptide to induce an immune response
) the task of identifying peptides that bind to MHC molecules is immunologically important.
Peptide Binding: Problem
Studies shown that binding peptides typically have specific amino acids at specific anchor positions.
Rules for predicting binding based solely on anchor position preferences, motifs, are inadequate.
Binding is also known to be influenced by (i) presence of secondary anchor positions, and (ii) between-position amino acid interactions.
It is the search for this more complex structure that constitutes the problem of interest.
Complex structure ./ Artificial Neural Networks.
Position 2
0.4
0.6
0.8
0.0 0.1 0.2 0.3 0.4 0.5
Position 1
0.0
0.2
Non-Binders Binders
A
C
D
E
F
G
H
I
K
L
N
P
Q
R
S
T
V
Y
A
C
D
E
F
G
H
I
L
M
N
P
Q
R
S
T
V
W
Y
P
Q
R
S
T
V
W
Y
P
Q
R
S
T
V
W
Y
P
Q
R
S
T
V
W
Y
Position 4
0.0
0.0
0.1
0.2
0.2
0.4
0.3
0.6
Position 3
K
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
A
C
D
E
F
G
H
I
L
M
N
Position 6
0.0
0.0
0.1
0.10
0.2
0.3
0.20
0.4
Position 5
K
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
A
C
D
E
F
G
H
I
L
M
N
Position 8
0.0
0.0
0.2
0.2
0.4
0.4
0.6
Position 7
K
A
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
A
C
D
E
F
G
H
I
K
L
M
N
Peptide Binding: Data Structure, Issues
Binary outcome: Binding (yes/no). 8 unordered categorical covariates: the amino acids at the respective positions.
Highly polymorphic data: respectively 18, 20, 20, 20, 20, 20, 19, 20 distinct amino acids.
Key concerns: large number of corresponding indicator variables, between position interactions.
To avert related difficulties Milik et al., use select biophysical and biochemical properties of amino acids: adequacy? ) potential information loss.
This structure is representative of a vast class of problems: Genotype 7! Phenotype.
Peptide Binding: Regression Difficulties
Problems occur irrespective of outcome type. Regression modelling of binding: Default starting model includes each position. This entails estimating 149 coefficients; just assimilating the output will be difficult.
This for a simple model in a small (8-mer) setting. Adjacent and/or second nearest neighbor amino acids impact ability to bind to MHC: this suggests including third-order interactions.
But, problems even for second-order interactions: SAS, S-Plus break – lack of dynamic memory. Not remedied by expansion or forward selection.
Full Tree // Training data
1 92/223
pos8:A,C,D,E,G,H,K,N,P,Q,R,S,T,V,W pos8:F,I,L,M,Y
0 17/101
1 8/122
pos1:A,C,D,E,F,G,H,I,K,L,N,P,R,V
pos5:E,P,S,T,V
pos1:Q,S,T,Y
0 0/60
pos5:A,F,I,L,M,N,Y
0 17/41
0 3/10
pos5:A,C,D,G,I,L,N,P,Q,R,S,T,V
pos2:F,L,M
pos5:F,H,M,Y
0 4/27
0 0/22
1 2/5
pos2:A,N,P
pos6:S,Y
1 1/5
pos2:G,S,T
1 1/5
pos6:D,E,L,V
pos2:A,D,H,T
1 1/14
pos6:D,E,H,L,M,P,Q,R,T,V
1 1/112
1 0/9
0 0/5
pos6:A,G,H,I,N,P,Q,R,S,T,Y
1 1/5
1 0/107
100 90 80
deviance
110
120
Tree Deviance versus Tree Size // Test data
2
4
6 size
8
Predictions: test data
1 37/87 pos8:A,C,D,E,G,H,K,N,P,Q,R,S,T,V,W pos8:F,I,L,M,Y
0 7/37
1 7/50
pos1:A,C,D,E,F,G,H,I,K,L,N,P,R,V pos5:E,P,S,T,V pos1:Q,S,T,Y
0 1/23
0 6/14
pos5:A,F,I,L,M,N,Y
0 0/1
1 2/44
Peptide Binding: Tree Attributes
Salient feature of trees re unordered categorical covariates (amino acids) is flexible (exhaustive) and automated handling of groups of levels: avoid computing/examining individual coefficients; covariate integrity preserved.
Interactions are readily accommodated. Easy interpretation/prediction via tree schematic.
Oft-cited deficiency of tree methods is piecewise constant response surfaces provide poor/inefficient approximations to smooth response surfaces: motivated MARS (HTF, Secn 9.4) modifications.
Here such concerns are moot. Notion of a smooth response surface requires ordered covariates – otherwise nothing to be smooth with respect to.