Multiple choice question for engineering
1. Which of the following is incorrect statement?
a) In a phylogram, the branch lengths represent the amount of evolutionary divergence
b) Trees like cladogram are said to be scaled
c) The scaled trees have the advantage of showing both the evolutionary relationships and information about the relative divergence time of the branches
d) In a cladogram, the external taxa line up neatly in a row or column
nswer: b [Reason:] In a cladogram, the external taxa line up neatly in a row or column. Their branch lengths are not proportional to the number of evolutionary changes and thus have no phylogenetic meaning. In such an unscaled tree, only the topology of the tree matters, which shows the relative ordering of the taxa.
2. Which of the following is incorrect statement about Newick Format?
a) It was designed to provide information of tree topology to computer programs without having to draw the tree itself
b) In this format, trees are represented by taxa excluded in nested parentheses
c) In this linear representation, each internal node is represented by a pair of parentheses
d) For a tree with scaled branch lengths, the branch lengths in arbitrary units are placed immediately after the name of the taxon separated by a colon
Answer: b [Reason:] Trees are represented by taxa included in nested parentheses. In this linear representation, each internal node is represented by a pair of parentheses that enclose all member of a monophyletic group separated by a comma.
3. Sometimes a tree-building method may result in several equally optimal trees. A consensus tree can be built by showing the commonly resolved bifurcating portions and collapsing the ones that disagree among the trees, which results in a polytomy.
Answer: a [Reason:] Combining the nodes can be done either by strict consensus or by majority rule.
In a strict consensus tree, all conflicting nodes are collapsed into polytomies. In a consensus tree based on a majority rule, among the conflicting nodes, those that agree by more than 50% of the nodes are retained whereas the remaining nodes are collapsed into multifurcation.
4. The number of rooted trees (NR) for n taxa is ______
a) NR = (2n− 3)! /2n+2 (n− 2)!
b) NR = (2n− 3)! /2n (n− 2)!
c) NR = (2n− 3)! /2n−2 (n− 5)!
d) NR = (2n− 3)! /2n−2 (n− 2)!
Answer: d [Reason:] The number of potential tree topologies can be enormously large even with a moderate number of taxa. The increase of possible tree topologies follows an exponential function. In this formula, (2n−3)! Is a mathematical expression of factorial, which is the product of positive integers from 1 to 2n − 3. For example, 5! = 1 × 2 × 3 × 4 × 5 = 120.
5. For unrooted trees, the number of unrooted tree topologies (NU) is ________
a) NU = (2n− 5)!/2n−3(n− 5)!
b) NU = (2n− 5)!/2n−3(n− 3)!
c) NU = (2n− 5)!/2−2(n− 3)!
d) NU = (2n− 5)!/2n(n− 3)!
Answer: b [Reason:] The number of possible topologies increases extremely rapidly with the number of taxa. For six taxa, there are 105 unrooted trees and 945 rooted trees. If there are ten taxa, there can be 2,027,025 unrooted trees and 34,459,425 rooted ones.
6. It can be computationally very demanding to find a true phylogenetic tree when the number of sequences is large.
Answer: a [Reason:] Because the number of rooted topologies is much larger than that for unrooted ones, the search for a true phylogenetic tree can be simplified by calculating the unrooted trees first. Once an optimal tree is found, rooting the tree can be performed by designating a number of taxa in the data set as an outgroup based on external information to produce a rooted tree.
7. Which of the following is incorrect statement about Molecular Markers?
a) For studying very closely related organisms, protein sequences are preferred
b) The decision to use nucleotide or protein sequences depends on the purposes of the study
c) For constructing molecular phylogenetic trees, one can use either nucleotide or protein sequence data
d) The decision to use nucleotide or protein sequences depends on the properties of the sequences
Answer: a [Reason:] The choice of molecular markers is an important matter because it can make a major difference in obtaining a correct tree. For studying very closely related organisms, nucleotide sequences, which evolve more rapidly than proteins, can be used. For example, for evolutionary analysis of different individuals within a population, noncoding regions of mitochondrial DNA are often used.
8. For studying the evolution of ________ divergent groups of organisms, one may choose either ______ nucleotide sequences, such as ribosomal RNA or protein sequences.
a) less widely, slowly evolving
b) more widely, slowly evolving
c) more widely, rapidly evolving
d) less widely, rapidly evolving
Answer: b [Reason:] If the phylogenetic relationships to be delineated are at the deepest level, such as between bacteria and eukaryotes, using conserved protein sequences makes more sense than using nucleotide sequences.
9. In many cases, ______ sequences are preferable to ______ sequences because they are relatively ____ conserved.
a) protein, nucleotide, less
b) nucleotide, protein, less
c) protein, nucleotide, more
d) nucleotide, protein, more
Answer: c [Reason:] Protein sequences are preferable to nucleotide sequences because protein sequences are relatively more conserved as a result of the degeneracy of the genetic code in which sixty-one codons encode for twenty amino acids, meaning thereby a change in a codon may not result in a change in amino acid.
10. Protein sequences can remain the same while the corresponding DNA sequences have more room for variation.
Answer: a [Reason:] The protein sequences can remain the same while the corresponding DNA sequences have more room for variation, especially at the third codon position. The significant difference in evolutionary rates among the three nucleotide positions also violates one of the assumptions of tree-building. In contrast, the protein sequences do not suffer from this problem, even for divergent sequences.
11. DNA sequences are sometimes more biased than protein sequences because of preferential codon usage in different organisms.
Answer: a [Reason:] In this case, different codons for the same amino acid are used at different frequencies, leading to sequence variations not attributable to evolution. In addition, the genetic code of mitochondria varies from the standard genetic code. Therefore, for comparison of mitochondria protein-coding genes, it is necessary to translate the DNA sequences into protein sequences.
12. In Jukes–Cantor Model to correct evolutionary distances, A formula for deriving evolutionary distances that include hidden changes is introduced by using a logarithmic function. It is ____
a) dAB = −(3/4) log[1 − (4/7)pAB].
b) dAB = −(3/4) ln[1 − (5/3)pAB].
c) dAB = −(3/4) log[1 − (4/3)pAB].
d) dAB = −(3/4) ln[1 − (4/3)pAB].
Answer: d [Reason:] The simplest nucleotide substitution model is the Jukes–Cantor model, which assumes that all nucleotides are substituted with equal probability. dAB is the evolutionary distance between sequences A and B and p AB is the observed sequence distance measured by the proportion of substitutions over the entire length of the alignment.
1. GeneQuiz focuses on deriving a predicted protein function, based on a variety of available evidence, including the evaluation of the similarity to the closest homolog in a database.
Answer: a [Reason:] GeneQuiz is an integrated system for large-scale biological sequence analysis that uses a variety of search and analysis methods using current sequence databases. By applying expert rules to the results of the different methods, GeneQuiz creates a compact summary of findings.
2. Which of the given statement is incorrect regarding MAGPIE?
a) It analyzes the genome using a set of automated processes
b) It is designed for high-throughput genome sequence analysis
c) It is unable to locate potential promoters
d) It automatically annotates genomic sequence data and maintains a daily up-to-date record in response to user queries about one or more genomes
Answer: c [Reason:] The system also uses a set of rules in logic programming to make decisions that may be used to interpret information from various sources. It has been used to locate potential promoters, terminators, start codons, Shine-Dalgarno sites, DNA motif sites, co-transcription units, and putative operons in microbial genomes. These sites are shown on a map display of the genome that may be edited.
3. Which of the given statement is incorrect?
a) paralogous sequences, frequently are found to have dissimilar functions
b) An early classification scheme for eight related groups of E. coli genes included categories for enzymes, transport elements
c) An early classification scheme for eight related groups of E. coli genes included categories for regulators, membranes, structural elements, protein factors, leader peptides, and carriers
d) Ninety percent of E. coli genes related by significant sequence similarity fell into these same broad categories
Answer: a [Reason:] Genes that are significantly similar in an organism, i.e., paralogous sequences, frequently are found to have a related biological function. This discovery follows the expected origin of paralogs by gene duplication events, leaving one copy to perform the original function and producing a second copy to develop a new function not too distant from the original one under evolutionary selection.
4. The designation ECa.b.c.d conveys information. Which of the following is not one of it?
a) One of twelve main classes of biochemical reactions
b) The group of substrate molecule
c) The nature of chemical bond that is involved in the reaction
d) Designation for acceptor molecules (cofactors)
Answer: a [Reason:] Option a should be ‘one of six main classes of biochemical reactions’. The Enzyme Commission numbers formulated by the Enzyme Commission of the International Union of Biochemistry and Molecular Biology provide a detailed way to classify enzymes based on the biochemical reactions they catalyze.
5. An approach to classification of genes that encode enzymes is to examine relationships among multiple enzymes that perform the same biochemical function in the same organism.
Answer: a [Reason:] Although catalyzing the same reaction, these enzymes showed variations in metabolic regulation of their activity. More than one-half of multiple enzymes in E. coli share significant sequence similarity; i.e., they are paralogs. However, the remainder do not share any sequence similarity.
6. Other functional classification schemes for genes include a broader category for genes involved in the same biological process, e.g., a three-group scheme for energy-related, information-related, and communication-related genes has also been used.
Answer: a [Reason:] By this scheme, plants devote more than one-half of their genome to energy metabolism. Whereas, animals devote one-half of their genome to communication-related functions.
7. Two species that have recently diverged from a common ancestor might be expected to have a ____ set of genes and _____ chromosomes with these genes positioned along the chromosomes in the same order.
a) distinct, similar
b) similar, distinct
c) similar, dissimilar
d) similar, similar
Answer: d [Reason:] Over evolutionary time, the sequence of each pair of genes will slowly diverge, as the species diverge and other changes such as geneduplication and gene loss change the gene content. In addition, the order of genes also changes over evolutionary time as a result of chromosomal rearrangements.
8. Which of the given statement is incorrect about the observations made with regard to gene order?
a) Order is highly conserved in closely related species
b) Order in closely related species becomes changed by rearrangements over evolutionary time
c) As more and more rearrangements occur, there will no longer be any correspondence in the order of orthologous genes on the chromosome of one organism with that of a second organism
d) Order is less conserved in closely related species
Answer: d [Reason:] Order is more conserved in closely related species. Another observation is that the groups of genes that have a similar biological function tend to remain localized in a group or cluster.
9. Which of the given statement is incorrect about the Chromosomal Rearrangements?
a) Comparison of the number of rearrangements in a given period of evolutionary history may vary significantly from one organism to the next
b) If gene A has a neighboring gene B, then if an ortholog of A occurs in another genome, there is an increased probability of an ortholog of B also occurring in the other organism
c) If gene A has a neighboring gene B, then if an ortholog of A occurs in another genome, the B ortholog is more likely to be a neighbor of the A ortholog of the genome of the second species if the two species are more divergent
d) In general, the order of orthologs is not well conserved in prokaryotes when the genomes have diverged sufficiently that the orthologs have < 50% identity
Answer: c [Reason:] The B ortholog is less likely to be a neighbor of the A
ortholog of the genome of the second species if the two species are more divergent. By classifying genes using a nine class functional classification scheme, several genes falling into the same functional category are clustered together on the chromosomes of both of these organisms, and the clusters are in a similar order.
10. Which of the given statement is incorrect?
a) In a given organism or species, genes are found in a given order that is maintained on the chromosomes from one generation to the next
b) Genes with a related function are frequently found to be distorted on a chromosome
c) A possibility is that there is genetic variation (alleles) within each gene in a cluster of a given species and that only certain allelic combinations of different genes are compatible
d) Clustering of related genes presumably provides an evolutionary advantage to a species
Answer: b [Reason:] Genetic analysis has revealed that genes with a related function are frequently found to be clustered at one chromosomal location. As genome-by-genome comparisons of the chromosomes of related species are made and the rearrangements are discovered, a further challenge to computational and evolutionary biologists is to estimate the number and types of rearrangements that have occurred and also to determine when they occurred. For example, a comparison of the mouse and human chromosomes reveals many rearrangements.
1. A gene phylogeny only describes the evolution of a particular gene or encoded protein.
Answer: a [Reason:] One of the objectives of building phylogenetic trees based on molecular sequences is to reconstruct the evolutionary history of the species involved. However, strictly speaking, a gene phylogeny (phylogeny inferred from a gene or protein sequence) only describes the evolution of that particular gene or encoded protein.
2. Evolution of a particular sequence _______ correlate with the evolutionary path of the species.
a) does not
c) does not necessarily
Answer: c [Reason:] The sequence may evolve more or less rapidly than other genes in the genome or may have a different evolutionary history from the rest of the genome owing to horizontal gene transfer events. Thus, the evolution of a particular sequence does not necessarily correlate with the evolutionary path of the species.
3. The species evolution is the ______ of evolution by _____ in a genome.
a) combined result, multiple genes
b) result, single genes
c) result, sole genes
d) distinct results, single gene
Answer: a [Reason:] In a species tree, the branching point at an internal node represents the speciation event whereas, in a gene tree, the internal node indicates a gene duplication event. The two events may or may not coincide.
4. To obtain a species phylogeny, phylogenetic trees from a variety of gene families need to be constructed
Answer: a [Reason:] To obtain a species phylogeny, phylogenetic trees from a variety of gene families need to be constructed to give an overall assessment of the species evolution. Phylogenetic trees drawn as cladograms (top) and phylograms (bottom).
5. It is often desirable to define the root of a tree. There are two ways to define the root of a tree. One is to use an outgroup, which ______
a) is a sequence that is homologous to the sequences under consideration
b) is separated from those sequences at an early evolutionary time
c) is generally determined from independent sources of information
d) is generally determined from similar or related sources of information
Answer: d [Reason:] Here, option c and contradict and it can be explained as follows. For example, a bird sequence can be used as a root for the phylogenetic analysis of mammals based on multiple lines of evidence that indicate that birds branched off prior to all mammalian taxa in the in group. Outgroups are required to be distinct from the in group sequences, but not too distant from the in group.
6. Which of the following is incorrect statement about the Kimura model?
a) It is a model to correct evolutionary distances and is a more sophisticated model
b) In this, the mutation rates for transitions and transversion are assumed to be different
c) According to this model, occur more frequently than transversions
d) According to this model, transversions occur more frequently than transitions
Answer: d [Reason:] This provides a more realistic estimate of evolutionary distances. The Kimura model uses the following formula dAB = −(1/2) ln(1 − 2pti − ptv) − (1/4) ln(1 − 2ptv). dAB is the evolutionary distance between sequences Aand B, pti is the observed frequency for transition, and ptv the frequency of transversion.
7. Which of the following is incorrect statement about Choosing Substitution Models?
a) There is one substitution at a particular position, in divergent sequences
b) The evolutionary divergence is beyond the ability of the statistical models to correct
c) The statistical models used to correct homoplasy are called substitution models or evolutionary models
d) For constructing DNA phylogenies, there are nucleotide substitution models available
Answer: a [Reason:] The caveat of using these models is that if there are too many multiple substitutions at a particular position, which is often true for very divergent sequences, the position may become saturated. This means that the evolutionary divergence is beyond the ability of the statistical models to correct. In this case, true evolutionary distances cannot be derived. Therefore, only reasonably similar sequences are to be used in phylogenetic comparisons.
8. The second step in phylogenetic analysis is to construct sequence alignment. This is probably the most critical step in the procedure because it establishes positional correspondence in evolution.
Answer: a [Reason:] Incorrect alignment leads to systematic errors in the final tree or even a completely wrong tree. For that reason, it is essential that the sequences are correctly aligned. Multiple state-of-the-art alignment programs such as T-Coffee should be used. The alignment results from multiple sources should be inspected and compared carefully to identify the most reasonable one. Automatic sequence alignments almost always contain errors and should be further edited or refined if necessary.
1. Which of the following is untrue?
a) Eukaryotic nuclear genomes are much larger than prokaryotic ones
b) They tend to have a very high gene density
c) Eukaryotic nuclear genomes’ sizes range from 10 Mbp to 670 Gbp (1 Gbp = 109 bp)
d) They tend to have a very high gene density
Answer: b [Reason:] In humans, for instance, only3%of the genome codes for genes, with about 1 gene per 100 kbp on average. The space between genes is often very large and rich in repetitive sequences and transposable elements.
2. Which of the following is untrue about translation and transcription?
a) The first is capping is at the 5’ end of the transcript which involves methylation at the initial residue of the RNA
b) The splicing process involves a large RNA-protein complex called spliceosome
c) The second event is splicing, which is the process of removing exons and joining introns
d) The second event is splicing, which is the process of removing introns and joining exons
Answer: c [Reason:] The reaction requires intermolecular interactions between a pair of nucleotides at each end of an intron and the RNA component of the spliceosome.To make the matter even more complex, some eukaryotic genes can have their transcripts spliced and joined in different ways to generate more than one transcript per gene. This is the phenomenon of alternative splicing.
3. The main issue in prediction of eukaryotic genes is the identification of exons, introns, and splicing sites.
Answer: a [Reason:] From a computational point of view, it is a very complex and challenging problem. Because of the presence of split gene structures, alternative splicing, and very low gene densities, the difficulty of finding genes in such an environment is likened to finding a needle in a haystack.
4. Most vertebrate genes use ____ as the translation start codon and have a uniquely conserved flanking sequence call a Kozak sequence (CCGCCATGG).
Answer: b [Reason:] In addition, most of these genes have a high density of CG dinucleotides near the transcription start site. This region is referred to as a CpG island (p refers to the phosphodiester bond connecting the two nucleotides), which helps to identify the transcription initiation site of a eukaryotic gene. The poly-A signal can also help locate the final coding sequence.
5. Which of the following is untrue about Ab Initio–Based Programs for Gene Prediction?
a) The goal of the ab initio gene prediction programs is to discriminate exons from noncoding sequences
b) The goal is joining exons together in the correct order
c) The main difficulty is correct identification of exons
d) To predict exons, the algorithms rely solely on gene signals
Answer: d [Reason:] To predict exons, the algorithms rely on two features, gene signals and gene content. Signals include gene start and stop sites and putative splice sites, recognizable consensus sequences such as poly-A sites.
6. In Ab Initio–Based Programs for Gene Prediction– Gene content refers to coding statistics, which includes nonrandom nucleotide distribution, amino acid distribution, synonymous codon usage, and hexamer frequencies.
Answer: a [Reason:] Among these features, the hexamer frequencies appear to be most discriminative for coding potentials. To derive an assessment for this feature,HMMscan be used, which require proper training. In addition to HMMs, neural network-based algorithms are also common in the gene prediction field.
7. Which of the following is untrue about PredictionUsing NeuralNetworks for Gene Prediction?
a) A neural network is a statistical model with a special architecture for pattern recognition and classification
b) It is composed of a network of mathematical variables
c) They resembles ab initio approaches
d) The variables in NeuralNetworks resemble the biological nervous system, with variables or nodes connected by weighted functions that are analogous to synapses
Answer: c [Reason:] Aspect of the model that makes it look like a biological neural network is its ability to “learn” and then make predictions after being trained. The network is able to process information and modify parameters of the weight functions between variables during the training stage. Once it is trained, it is able to make automatic predictions about the unknown. This is quite different than the ab initio methods.
8. Which of the following is untrue about Prediction Using Neural Networks for Gene Prediction?
a) A neural network is constructed with multiple layers; the input, output, and hidden layers
b) The input is the gene sequence with intron and exon signals
c) The model is not fed with a sequence of known gene structure
d) The output is the probability of an exon structure
Answer: c [Reason:] Between input and output, there may be one or several hidden layers where the machine learning takes place. The machine learning process starts by feeding the model with a sequence of known gene structure. The gene structure information is separated into several classes of features such as hexamer frequencies, splice sites, and GC composition during training. The weight functions in the hidden layers are adjusted during this process to recognize the nucleotide patterns and their relationship with known structures.
9. GRAIL is a web-based program that is based on a neural network algorithm Which is trained on several statistical features such as splice junctions, start and stop codons, poly-A sites, promoters, and CpG islands.
Answer: a [Reason:] The program scans the query sequence with windows of variable lengths and scores for coding potentials and finally produces an output that is the result of exon candidates. The program is currently trained for human, mouse, Arabidopsis, Drosophila, and Escherichia coli sequences.
10. Which of the following is untrue about Prediction Using Discriminant Analysis for Gene Prediction?
a) QDA draws a curved line based on a quadratic function
b) LDA works by drawing a diagonal line that best separates coding signals from noncoding signals based on knowledge learned from training data sets of known gene structures
c) Some gene prediction algorithms rely on discriminant analysis, either LDA or quadratic discriminant analysis (QDA), to improve accuracy
d) LDA works by plotting a three-dimensional graph of coding signals versus all potential 3’ splice site positions
Answer: d [Reason:] QDA draws a curved line based on a quadratic function instead of drawing a straight line to separate coding and noncoding features. This strategy is designed to be more flexible and provide a more optimal separation between the data points.
1. Which of the following is a wrong statement?
a) Prokaryotes include bacteria and Archaea
b) Prokaryotes have relatively large genomes
c) Prokaryotes have relatively small genomes
d) In Prokaryotes, The gene density in the genomes is high, with more than 90% of a genome sequence containing coding sequence
Answer: b [Reason:] Prokaryotes have relatively small genomes with sizes ranging from0.5 to 10Mbp (1Mbp=106 bp). Each prokaryotic gene is composed of a single contiguous stretch of ORF coding for a single protein or RNA with no interruptions within a gene.
2. In bacteria, the majority of genes have a start codon ATG (orAUG in mRNA; because prediction is done at the DNA level, T is used in place of U), which codes for methionine.
Answer: a [Reason:] Occasionally, GTG and TTG are used as alternative start codons. But methionine is still the actual amino acid inserted at the first position.
3. The presence of these codons at The beginning of the frame _____ give a clear indication of the translation initiation site.
b) does not necessarily
c) does not
Answer: b [Reason:] Because there may be multiple ATG, GTG, or TGT codons in a frame, the presence of these codons at the beginning of the frame does not necessarily give a clear indication of the translation initiation site. Instead, to help identify this initiation codon, other features associated with translation are used.
4. Shine-Delgarno sequence, which is a stretch of purine-rich sequence complementary to 16S rRNA in the ribosome.
Answer: a [Reason:] It is located immediately downstream of the transcription initiation site and slightly upstream of the translation start codon. In many bacteria, it has a consensus motif of AGGAGGT. Identification of the ribosome binding site can help locate the start codon.
5. There are ____ possible stop codons, identification of which is straightforward.
Answer: d [Reason:] At the end of the protein coding region is a stop codon that causes translation to stop. There are three possible stop codons, identification of which is straightforward. Many prokaryotic genes are transcribed together as one operon.
6. Which of the following is a wrong statement regarding the conventional determination of open reading frames?
a) Without the use of specialized programs, prokaryotic gene identification can rely on manual determination of ORFs and major signals related to prokaryotic genes
b) Prokaryotic DNA is first subject to conceptual translation in all six possible frames, two frames forward and four frames reverse
c) A stop codon occurs in about every twenty codons by chance in a noncoding region
d) Prokaryotic DNA is first subject to conceptual translation in all six possible frames, three frames forward and three frames reverse
Answer: b [Reason:] Prokaryotic DNA is first subject to conceptual translation in all six possible frames, three frames forward and three frames reverse. Because a stop codon occurs in about every twenty codons by chance in a noncoding region, a frame longer than thirty codons without interruption by stop codons is suggestive of a gene coding region, although the threshold for an ORF is normally set even higher at fifty or sixty codons.
7. The putative ORF can be translated into a protein sequence, which is then used to search against a protein database.
Answer: a [Reason:] The putative frame is further manually confirmed by the presence of other signals such as a start codon and Shine–Delgarno sequence. Detection of homologs from this search is probably the strongest indicator of a protein-coding frame.
8. Which of the following is a wrong statement regarding TESTCODE method?
a) This is based on the nucleotide composition of the third position of a codon
b) In practice, because genes can be in any of the six frames, the statistical patterns are computed for all possible frames
c) It is implemented in the commercial GCG package
d) It exploits the fact that the third codon nucleotides in a coding region fails to repeat themselves
Answer: d [Reason:] In a coding sequence, it has been observed that this position has a preference to use G or C over A or T. By plotting the GC composition at this position, regions with values significantly above the random level can be identified, which are indicative of the presence of ORFs. This method exploits the fact that the third codon nucleotides in a coding region tend to repeat themselves.
9. The conventional determination of open reading methods identify only typical genes and tend to miss atypical genes in which the rule of codon bias is not strictly followed.
Answer: a [Reason:] These statistical methods, which are based on empirical rules, examine the statistics of a single nucleotide (either G or C). To improve the prediction accuracies, the new generation of prediction algorithms uses more sophisticated statistical models.
10. Which of the following is a wrong statement regarding Gene Prediction Using Markov Models and Hidden Markov Models?
a) Markov models and HMMs can be very helpful in providing finer statistical description of a gene
b) A Markov model describes the probability of the distribution of nucleotides in a DNA sequence
c) In a Markov model the conditional probability of a particular sequence position depends on k alternate positions
d) A zero-order Markov model assumes each base occurs independently with a given probability
Answer: c [Reason:] In a Markov model the conditional probability of a particular sequence position depends on k previous positions. In this case, k is the order of a Markov model. In a zero-order Markov model, it is often the case for noncoding sequences. A first-order Markov model assumes that the occurrence of a base depends on the base preceding it. A second-order model looks at the preceding two bases to determine which base follows, which is more characteristic of codons in a coding sequence.
11. The use of Markov models in gene finding exploits the fact that oligonucleotide distributions in the coding regions are different from those for the noncoding regions.
Answer: a [Reason:] These can be represented with various orders of Markov models. Since a fixed-order Markov chain describes the probability of a particular nucleotide that depends on previous k nucleotides, the longer the oligomer unit, the more non-randomness can be described for the coding region. Therefore, the higher the order of a Markov model, the more accurately it can predict a gene.
12. Because a protein-encoding gene is composed of nucleotides in triplets as codons, more effective Markov models are built in sets of three nucleotides, describing nonrandom distributions of trimers or hexamers, and so on.
Answer: a [Reason:] The parameters of a Markov Model have to be trained using a set of sequences with known gene locations. Once the parameters of the model are established, it can be used to compute the nonrandom distributions of trimers or hexamers in a new sequence to find regions that are compatible with the statistical profiles in the learning set.