When a molecular biologist clones and sequences a DNA fragment, he/she will typically want to determine if the determined sequence has been previously identified or if there are related sequences available. In addition, the investigator will also wish determine characteristics of the protein encoded by the gene.
In this exercise, you will
use the following DNA sequence as the basis for a number of exercises related
to database searches and analyses on the web. To start, you need
to use the mouse to select the text of the sequence. Once it is highlighted,
push Ctrl + C to copy the text on to the clipboard. You will then
be able to paste the sequence into various programs on the web.
1 cggcgccgcg agcttctcct ctcctcacga ccgaggcaga gcagtcatta
tggcgaacct
61 tggctgctgg
atgctggttc tctttgtggc cacatggagt gacctgggcc tctgcaagaa
121 gcgcccgaag
cctggaggat ggaacactgg gggcagccga tacccggggc agggcagccc
181 tggaggcaac
cgctacccac ctcagggcgg tggtggctgg gggcagcctc atggtggtgg
241 ctgggggcag
cctcatggtg gtggctgggg gcagccccat ggtggtggct ggggacagcc
301 tcatggtggt
ggctggggtc aaggaggtgg cacccacagt cagtggaaca agccgagtaa
361 gccaaaaacc
aacatgaagc acatggctgg tgctgcagca gctggggcag tggtgggggg
421 ccttggcggc
tacatgctgg gaagtgccat gagcaggccc atcatacatt tcggcagtga
481 ctatgaggac
cgttactatc gtgaaaacat gcaccgttac cccaaccaag tgtactacag
541 gcccatggat
gagtacagca accagaacaa ctttgtgcac gactgcgtca atatcacaat
601 caagcagcac
acggtcacca caaccaccaa gggggagaac ttcaccgaga ccgacgttaa
661 gatgatggag
cgcgtggttg agcagatgtg tatcacccag tacgagaggg aatctcaggc
721 ctattaccag
agaggatcga gcatggtcct cttctcctct ccacctgtga tcctcctgat
781 ctctttcctc
atcttcctga tagtgggatg aggaaggtct tcctgttttc accatctttc
841 taatcttttt
ccagcttgag ggaggcggta tccacctgca gcccttttag tggtggtgtc
901 tcactctttc
ttctctcttt gtcccggata ggctaatcaa tacccttggc actgatgggc
961 actggaaaac
atagagtaga cctgagatgc tggtcaagcc ccctttgatt gagttcatca
1021 tgagccgttg ctaatgccag
gccagtaaaa gtataacagc aaataaccat tggttaatct
1081 ggacttattt ttggacttag
tgcaacaggt tgaggctaaa acaaatctca gaacagtctg
1141 aaataccttt gcctggatac
ctctggctcc ttcagcagct agagctcagt atactaatgc
1201 cctatcttag tagagatttc
atagctattt agagatattt tccattttaa gaaaacccga
1261 caacatttct gccaggtttg
ttaggaggcc acatgatact tattcaaaaa aatcctagag
1321 attcttagct cttgggatgc
aggctcagcc cgctggagca tgagctctgt gtgtaccgag
1381 aactggggtg atgttttact
tttcacagta tgggctacac agcagctgtt caacaagagt
1441 aaatattgtc acaacactga
acctctggct agaggacata ttcacagtga acataactgt
1501 aacatatatg aaaggcttct
gggacttgaa atcaaatgtt tgggaatggt gcccttggag
1561 gcaacctccc attttagatg
tttaaaggac cctatatgtg gcattccttt ctttaaacta
1621 taggtaatta aggcagctga
aaagtaaatt gccttctaga cactgaaggc aaatctcctt
1681 tgtccattta cctggaaacc
agaatgattt tgacatacag gagagctgca gttgtgaaag
1741 caccatcatc atagaggatg
atgtaattaa aaaatggtca gtgtgcaaag aaaagaactg
1801 cttgcatttc tttatttctg
tctcataatt gtcaaaaacc agaattaggt caagttcata
1861 gtttctgtaa ttggcttttg
aatcaaagaa tagggagaca atctaaaaaa tatcttaggt
1921 tggagatgac agaaatatga
ttgatttgaa gtggaaaaag aaattctgtt aatgttaatt
1981 aaagtaaaat tattccctga
attgtttgat attgtcacct agcagatatg tattactttt
2041 ctgcaatgtt attattggct
tgcactttgt gagtatctat gtaaaaatat atatgtatat
2101 aaaatatata ttgcatagga
cagacttagg agttttgttt agagcagtta acatctgaag
2161 tgtctaatgc attaactttt
gtaaggtact gaatacttaa tatgtgggaa acccttttgc
2221 gtggtcctta ggcttacaat
gtgcactgaa tcgtttcatg taagaatcca aagtggacac
2281 cattaacagg tctttgaaat
atgcatgtac tttatatttt ctatatttgt aactttgcat
2341 gttcttgttt tgttatataa
aaaaattgta aatgtttaat atctgactga aattaaacga
2401 gcgaagatga gcacc
Now that you have selected and copied the text, you will use a tool maintained at the National Center for Biotechnology Information (NCBI) to analyze this sequence. The first program you will use is called ORF Finder that will search for open reading frames (ORFs) in your sequence. To do this go first to the NCBI Home Page at http://www.ncbi.nlm.nih.gov. For your convenience, use the right mouse button to allow you to open this link in a new window.
This page has links to many tools including search engines for sequences and structures, and PubMed, which allows access to 9 million citations and abstracts in MEDLINE.
Analysis of Open Reading Frames
When a DNA sequence is obtained, we generally want to know if the sequence codes for a protein. To do this, we look for start codons (ATG) on each strand of the DNA, and then read the sequence 5' to 3' to see if codons continue in frame. If they continue, they constitute an ORF, and potentially a protein-encoding sequence. Connect to the ORF finder program link toward the bottom of the "Hot Spots" list. To run the program paste (Ctrl + v) the sequence into the large data input window thats labeled "sequence in FASTA format". Click on the "OrfFind" button above the window. Within a minute or two an ORF list will be shown. On the right will be the ORFs listed by length. On the right will be a map of the DNA sequence showing the ORFs (in blue) in each of the 6 possible reading frames.
Click on the left-most blue ORF (reading frame 2) and examine the sequence.
Q1. How long is the protein? At which base pair (number in the DNA sequence) does the ORF start? At which base pair (number in the DNA sequence) does the ORF end? What stop codon is used? What are the first and last amino acids in the sequence?
Now click on the ORF farthest to the right in reading frame 4 and examine the sequence.
Q2. How long is the
protein? At which base pair (number in the DNA sequence) does the
ORF start? At which base pair (number in the DNA sequence)
does the ORF end? What stop codon is used? What are the first
and last amino acids in the sequence?
Q3. Why are there six
possible reading frames? What do you think the +1, +3, -1, etc. refer
to in the list of ORFs?
Search for and align with similar sequences at GenBank
Go back to the NCBI Home Page. You will now search GenBank, a repository for DNA sequences, for this sequence. The program you will use is called BLAST (Basic Local Alignment Search Tool) which will list sequences registered at GenBank from increasing to decreasing similarity. Click on the BLAST button near the top of the page. To run the program click Basic BLAST Search and paste (Ctrl + v) the sequence into the data input window. The default is a search of nucleotide sequences. Click on Submit Query. In a minute or two you will download a long page of data. On top of the list of sequences you will see a table summarizing alignments. Below the table is a list of matching sequences. The higher the score, the better the match; the smaller the E value, the less likely the match is just by chance. Scroll down, and you will come to a base to base line up of each of the sequence pairs.
Q4. How many matches do you find?
The first listed sequence (ref|NM_000311.1|)is the sequence of interest. Click on the link to examine the GenBank entry.
In the top portion of the entry you will find a summary of information.
Q5. What is the name
of the gene?
Q6. How many basepairs
is the reported sequence?
Q7. From what organism
is the clone?
Q8. In what journal
was it first published?
Q9. When was the most
recent reference published
As you scroll down the page you will see several links to MEDLINE. Click on the the most recent.
Q10. What information is provided at this link?
At the MEDLINE entry, click on the Related Articles button on the top of the page.
Q11. How many articles are listed?
Q12. What is the common topic of the listed articles?
Click the back button to return to the GenBank entry and scroll down until you find the protein sequence.
Q13. How many amino acids are in the protein? (Click on the protein-ID link; click on the back button to return to the GenBank entry after determining the answer.) How does this compare with the results from ORF Finder?
Q14. What base numbers
in the DNA sequence correspond to the codons of the protein (codons listed
as CDS)?
Perform an on-line restriction enzyme site analysis
You will now use a program to determine the restriction sites present in your sequence. Go to the Webcutter site and paste the sequence into the dialog box. Use the default settings for analysis.
Q15. How many times
does the enzyme Eco130I cut? Why do you suppose it cuts more frequently
than Eco147I?
Q16. Enzyme commonly
found in the multiple cloning sites of vectors are highlighted in color.
Which of these do not cut the DNA? Why might this information be
useful for cloning this piece of DNA?
You are now going to use
some programs to analyze the protein.
Select and copy the protein sequence (avoid
the quotation marks) and go to The
CMS Molecular Biology Resource (http://www.sdsc.edu/ResTools/cmshp.html).
Once you are at the site,
take a look around. Note that there are many sites for DNA and protein
analysis, molecular modeling, software sites, phylogeny sites,links to
journals, lab protocols, etc.
Determine from the amino acid sequence the probable subcellular location of the protein
Connect to the Sequence Search & Analysis Tools link listed first under Protein Analysis & Biochemistry. Click on to Sequence Motifs/Patterns Recognition and Analysis and scroll down to Signal Peptide Motifs connect to PSORT in a new window. If this link does not work for you, open this link-PSORT- in a new window. Once at this page, link to the PSORT II Prediction. Paste the protein sequence into the data form and submit the data. A results page will load in a minute or two. The PSORT results summary is at the bottom of the page and lists the probabilities for particular locations based on analysis of the protein sequence. Note that these probablities correspond to the likelihood of that prediction being correct, not the percentage of the protein in that location.
Q17. Where in the cell would you expect the prion protein to be located?
Predict the secondary structure from the primary structure
Go Back to the CMS page and link to Structure Prediction & Databases under Protein Analysis & Biochemistry. Under this heading you will find the heading 2D & 3D Structure Prediction Analyses. Scroll below to nnPredict and link to it. Paste the protein sequence into the data form and submit it.
Q18. What is the predominant
secondary structure in the prion protein?