ANALYSIS OF DNA AND PROTEIN SEQUENCES
 

When a molecular biologist clones and sequences a DNA fragment, he/she will typically want to determine if the determined sequence has been previously identified or if there are related sequences available.  In addition, the investigator will also wish determine characteristics of the protein encoded by the gene.

In this exercise, you will use the following DNA sequence as the basis for a number of exercises related to database searches and analyses on the web.  To start, you need to use the mouse to select the text of the sequence.  Once it is highlighted, push Ctrl + C to copy the text on to the clipboard.  You will then be able to paste the sequence into various programs on the web.
 

1 cggcgccgcg agcttctcct ctcctcacga ccgaggcaga gcagtcatta tggcgaacct
       61 tggctgctgg atgctggttc tctttgtggc cacatggagt gacctgggcc tctgcaagaa
      121 gcgcccgaag cctggaggat ggaacactgg gggcagccga tacccggggc agggcagccc
      181 tggaggcaac cgctacccac ctcagggcgg tggtggctgg gggcagcctc atggtggtgg
      241 ctgggggcag cctcatggtg gtggctgggg gcagccccat ggtggtggct ggggacagcc
      301 tcatggtggt ggctggggtc aaggaggtgg cacccacagt cagtggaaca agccgagtaa
      361 gccaaaaacc aacatgaagc acatggctgg tgctgcagca gctggggcag tggtgggggg
      421 ccttggcggc tacatgctgg gaagtgccat gagcaggccc atcatacatt tcggcagtga
      481 ctatgaggac cgttactatc gtgaaaacat gcaccgttac cccaaccaag tgtactacag
      541 gcccatggat gagtacagca accagaacaa ctttgtgcac gactgcgtca atatcacaat
      601 caagcagcac acggtcacca caaccaccaa gggggagaac ttcaccgaga ccgacgttaa
      661 gatgatggag cgcgtggttg agcagatgtg tatcacccag tacgagaggg aatctcaggc
      721 ctattaccag agaggatcga gcatggtcct cttctcctct ccacctgtga tcctcctgat
      781 ctctttcctc atcttcctga tagtgggatg aggaaggtct tcctgttttc accatctttc
      841 taatcttttt ccagcttgag ggaggcggta tccacctgca gcccttttag tggtggtgtc
      901 tcactctttc ttctctcttt gtcccggata ggctaatcaa tacccttggc actgatgggc
      961 actggaaaac atagagtaga cctgagatgc tggtcaagcc ccctttgatt gagttcatca
     1021 tgagccgttg ctaatgccag gccagtaaaa gtataacagc aaataaccat tggttaatct
     1081 ggacttattt ttggacttag tgcaacaggt tgaggctaaa acaaatctca gaacagtctg
     1141 aaataccttt gcctggatac ctctggctcc ttcagcagct agagctcagt atactaatgc
     1201 cctatcttag tagagatttc atagctattt agagatattt tccattttaa gaaaacccga
     1261 caacatttct gccaggtttg ttaggaggcc acatgatact tattcaaaaa aatcctagag
     1321 attcttagct cttgggatgc aggctcagcc cgctggagca tgagctctgt gtgtaccgag
     1381 aactggggtg atgttttact tttcacagta tgggctacac agcagctgtt caacaagagt
     1441 aaatattgtc acaacactga acctctggct agaggacata ttcacagtga acataactgt
     1501 aacatatatg aaaggcttct gggacttgaa atcaaatgtt tgggaatggt gcccttggag
     1561 gcaacctccc attttagatg tttaaaggac cctatatgtg gcattccttt ctttaaacta
     1621 taggtaatta aggcagctga aaagtaaatt gccttctaga cactgaaggc aaatctcctt
     1681 tgtccattta cctggaaacc agaatgattt tgacatacag gagagctgca gttgtgaaag
     1741 caccatcatc atagaggatg atgtaattaa aaaatggtca gtgtgcaaag aaaagaactg
     1801 cttgcatttc tttatttctg tctcataatt gtcaaaaacc agaattaggt caagttcata
     1861 gtttctgtaa ttggcttttg aatcaaagaa tagggagaca atctaaaaaa tatcttaggt
     1921 tggagatgac agaaatatga ttgatttgaa gtggaaaaag aaattctgtt aatgttaatt
     1981 aaagtaaaat tattccctga attgtttgat attgtcacct agcagatatg tattactttt
     2041 ctgcaatgtt attattggct tgcactttgt gagtatctat gtaaaaatat atatgtatat
     2101 aaaatatata ttgcatagga cagacttagg agttttgttt agagcagtta acatctgaag
     2161 tgtctaatgc attaactttt gtaaggtact gaatacttaa tatgtgggaa acccttttgc
     2221 gtggtcctta ggcttacaat gtgcactgaa tcgtttcatg taagaatcca aagtggacac
     2281 cattaacagg tctttgaaat atgcatgtac tttatatttt ctatatttgt aactttgcat
     2341 gttcttgttt tgttatataa aaaaattgta aatgtttaat atctgactga aattaaacga
     2401 gcgaagatga gcacc
 
 

Now that you have selected and copied the text, you will use a tool maintained at the National Center for Biotechnology Information (NCBI) to analyze this sequence.  The first program you will use is called ORF Finder that will search for open reading frames (ORFs) in your sequence.  To do this go first to the NCBI Home Page at http://www.ncbi.nlm.nih.gov. For your convenience, use the right mouse button to allow you to open this link in a new window.

This page has links to many tools including search engines for sequences and structures, and PubMed, which allows access to 9 million citations and abstracts in MEDLINE.

Analysis of Open Reading Frames

When a DNA sequence is obtained, we generally want to know if the sequence codes for a protein.  To do this, we look for start codons (ATG) on each strand of the DNA, and then read the sequence 5' to 3' to see if codons continue in frame.  If they continue, they constitute an ORF, and potentially a protein-encoding sequence. Connect to the ORF finder program link toward the bottom of the "Hot Spots" list.  To run the program paste (Ctrl + v) the sequence into the large data input window thats labeled "sequence in FASTA format".  Click on the "OrfFind" button above the window.  Within a minute or two an ORF list will be shown.  On the right will be the ORFs listed by length.  On the right will be a map of the DNA sequence showing the ORFs (in blue) in each of the 6 possible reading frames.

Click on the left-most blue ORF (reading frame 2) and examine the sequence.

Q1.  How long is the protein?  At which base pair (number in the DNA sequence) does the ORF start?   At which base pair (number in the DNA sequence) does the ORF end?  What stop codon is used?  What are the first and last amino acids in the sequence?

Now click on the ORF farthest to the right in reading frame 4 and examine the sequence.

Q2.  How long is the protein?  At which base pair (number in the DNA sequence) does the ORF start?   At which base pair (number in the DNA sequence) does the ORF end?  What stop codon is used?  What are the first and last amino acids in the sequence?
 

Q3.  Why are there six possible reading frames?  What do you think the +1, +3, -1, etc. refer to in the list of ORFs?
 

Search for and align with similar sequences at GenBank

Go back to the NCBI Home Page.  You will now search GenBank, a repository for DNA sequences, for this sequence.  The program you will use is called BLAST (Basic Local Alignment Search Tool) which will list sequences registered at GenBank from increasing to decreasing similarity. Click on the BLAST button near the top of the page.  To run the program click Basic BLAST Search and paste (Ctrl + v) the sequence into the data input window.  The default is a search of nucleotide sequences.  Click on Submit Query.  In a minute or two you will download a long page of data.  On top of the list of sequences you will see a table summarizing alignments. Below the table is a list of matching sequences.  The higher the score, the better the match; the smaller the E value, the less likely the match is just by chance.  Scroll down, and you will come to a base to base line up of each of the sequence pairs.

Q4.  How many matches do you find?

The first listed sequence (ref|NM_000311.1|)is the sequence of interest.  Click on the link to examine the GenBank entry.

In the top portion of the entry you will find a summary of information.

Q5.  What is the name of the gene?
Q6.  How many basepairs is the reported sequence?
Q7.  From what organism is the clone?
Q8.  In what journal was it first published?
Q9.  When was the most recent reference published

As you scroll down the page you will see several links to MEDLINE.  Click on the the most recent.

Q10.  What information is provided at this link?

At the MEDLINE entry, click on the Related Articles button on the top of the page.

Q11.  How many articles are listed?

Q12.  What is the common topic of the listed articles?

Click the back button to return to the GenBank entry and scroll down until you find the protein sequence.

Q13. How many amino acids are in the protein? (Click on the protein-ID link; click on the back button to return to the GenBank entry after determining the answer.)  How does this compare with the results from ORF Finder?

Q14.  What base numbers in the DNA sequence correspond to the codons of the protein (codons listed as CDS)?
 

Perform an on-line restriction enzyme site analysis
 

You will now use a program to determine the restriction sites present in your sequence.  Go to the Webcutter site and paste the sequence into the dialog box.  Use the default settings for analysis.

Q15.  How many times does the enzyme Eco130I cut?  Why do you suppose it cuts more frequently than Eco147I?
 

Q16.  Enzyme commonly found in the multiple cloning sites of vectors are highlighted in color.  Which of these do not cut the DNA?  Why might this information be useful for cloning this piece of DNA?
 
 
 

You are now going to use some programs to analyze the protein.  Select and copy the protein sequence (avoid the quotation marks) and go to The CMS Molecular Biology Resource (http://www.sdsc.edu/ResTools/cmshp.html).
 

Once you are at the site, take a look around.  Note that there are many sites for DNA and protein analysis, molecular modeling, software sites, phylogeny sites,links to journals, lab protocols, etc.
 

Determine from the amino acid sequence the probable subcellular location of the protein

Connect to the Sequence Search & Analysis Tools link listed first under Protein Analysis & Biochemistry.  Click on to Sequence Motifs/Patterns Recognition and Analysis and scroll down to Signal Peptide Motifs connect to PSORT in a new window. If this link does not work for you, open this link-PSORT- in a new window. Once at this page, link to the PSORT II Prediction.  Paste the protein sequence into the data form and submit the data.  A results page will load in a minute or two. The PSORT results summary is at the bottom of the page and lists the probabilities for particular locations based on analysis of the protein sequence. Note that these probablities correspond to the likelihood of that prediction being correct, not the percentage of the protein in that location.

Q17.  Where in the cell would you expect the prion protein to be located?

Predict the secondary structure from the primary structure

Go Back to the CMS page and link to Structure Prediction & Databases under Protein Analysis & Biochemistry.  Under this heading you will find the heading 2D & 3D Structure Prediction Analyses.  Scroll below to nnPredict and link to it.  Paste the protein sequence into the data form and submit it.

Q18.  What is the predominant secondary structure in the prion protein?