Translation "http://bioruby.g.hatena.ne.jp/nakao_mitsuteru/20071010/II"
By Nakao Mitsuteru (Lecture Note given on 2007-10-10)
Programming Basics II: Repeats, Code, Condition, Regular Expression
In this chapter, I will introduce Repeat, Code, Condition, Regular Expression.
These are widely often used not only by Bioinformatics, but also by Programming. Specially Regular Expression is a must item in order to treat DNA sequence as a string.
In addition, I will introduce many concepts specific to Programming here.
Repeats
It is very easy to use repeats by Programming language Ruby.
Repeat 100 times
REPEAT UNIT
puts "Hello !"
RESULT of EXECUTION
bioruby> puts "Hello !"
Hello !
==> nil
Do following to repeat this 100 times
bioruby> 100.times { puts "Hello !" }
RESULT of EXECUTION
bioruby> 100.times { puts "Hello !" }
Hello !
Hello !
Hello !
Hello !
Hello !
.........
Hello !
Hello !
Hello !
Hello !
Hello !
==> 100
Segment of program surrounded by { and } is called BLOCK.
100.times is followed by BLOCK and repeat it 100 times.
BLOCK is very often used in Ruby.
This is an interesting feature of Ruby.
If anybody who are interested in this feature, please read BLOCK section in Ruby book.
Repeat 100 times with numbers
repeat 100 times with numbers
100.times {|i|
puts "Hello ! " + i.to_s
}
RESULT of EXECUTION
bioruby> 100.times {|i|
bioruby+ puts "Hello ! " + i.to_s
}
Hello ! 0
Hello ! 1
Hello ! 2
Hello ! 3
Hello ! 4
..............
Hello ! 95
Hello ! 96
Hello ! 97
Hello ! 98
Hello ! 99
==> 100
100.times provides number of repeat cycles.
and place a number of cycle in a variable i between | and | every time.
A number of repeat starts from 0.
This feature is very convenient if you want to repeat it certain times.
Display elements of Array one by one
Use 'each' when you want to evaluate elements of group one by one.
Use 'each'
array = [0, 1, 2, 3, 4]
array.each {|x|
puts x
}
RESULT of EXECUTION
bioruby> array = [0, 1, 2, 3, 4]
==> [0, 1, 2, 3, 4]
bioruby> array.each {|x|
bioruby+ puts x
}
0
1
2
3
4
==> [0, 1, 2, 3, 4]
As seen above, when 'each' is used, each elements can be evaluated one by one.
Display power 2 values of each elements
array.each {|x|
puts x ** 2
RESULT of EXECUTION
bioruby> array.each {|x|
bioruby+ puts x ** 2
}
0
1
4
9
16
==> [0, 1, 2, 3, 4]
You can reverse them to the original elements.
To do this, use 'map'.
Use 'map'
array.map {|x|
x ** 2
}
RESULT of EXECUTION
bioruby> array.map {|x|
bioruby+ x ** 2
}
==> [0, 1, 4, 9, 16]
Use 'map' instead of 'each'. 'map' means mapping.
Input the result of 'map' in array2
array
array2 = array.map {|x|
x ** 2
}
array2
RESULT of EXECUTION
bioruby> array
==> [0, 1, 2, 3, 4]
bioruby> array2 = array.map {|x|
bioruby+ x ** 2
}
==> [0, 1, 4, 9, 16]
bioruby> array2
==> [0, 1, 4, 9, 16]
As shown above 'map' can perform same operation for each elements of Array in very convenient way.
it makes the process to transform a series of data into object.
'each' with index
Use 'each_with_index' when you want to use index of Array at same time
array.each_with_index {|x, i|
puts [i.to_s, x.to_s].join(" ")
}
RESULT of EXECUTION
bioruby> array.each_with_index {|x, i|
bioruby+ puts [i.to_s, x.to_s].join(" ")
}
0 0
1 1
2 2
3 3
4 4
==> [0, 1, 2, 3, 4]
Evaluate Elements of Hash One by One
Use 'each'
hash = {"DNA" => "ACGT", "RNA" => "ACGU"}
hash.each {|key, value|
puts [key, value].join(" ")
}
RESULT of EXECUTION
Use 'each'
bioruby> hash = {"DNA" => "ACGT", "RNA" => "ACGU"}
==> {"DNA"=>"ACGT", "RNA"=>"ACGU"}
bioruby> hash.each {|key, value|
bioruby+ puts [key, value].join(": ")
}
DNA: ACGT
RNA: ACGU
==> {"DNA"=>"ACGT", "RNA"=>"ACGU"}
Summary of Repeats
1. For repeat function, 'each' plays a central role.
2. 'each' can be used for both Array and Hash.
3. Operation for the entire elements of set (mapping) is done by 'map'.
Code
Code means a unit of program. Program composes of Codes.
It is all Codes we have copied and pasted during this practice.
In basic, code returns value. So far, the display followed by ==> after result of execusion using bioruby> prompt
is value of code.
Code "ACGT"
"ACGT"
RESULT of EXECUTION
bioruby> "ACGT"
==> "ACGT"
Value of Code "ACGT" is "ACGT".
Code dna = "ACGT"
dna = "ACGT"
RESULT of EXECUTION
bioruby> dna = "ACGT"
==> "ACGT"
Code 'puts dna'
puts dna
RESULT of EXECUTION
bioruby> puts dna
ACGT
==> nil
Value of 'put dna' is 'nil'. 'nil' is a special value, indicates 'empty'.
Summary of Code
1. We actually copied and pasted 'Codes' so far.
2. 'nil' is value of 'empty'.
Condition
Condition is structure to control process in program.
Now I introduce Boolean judgement (true/false) and control mechanism of branch.
Judgement of true or false
For condition branching, use true or false judgement. One value of code is such true or false.
Ruby use 'true' or 'false' as values.
Large and small comparison
1 < 2
RESULT of EXECUTION
bioruby> 1 < 2
==> true
Large and small comparison again
1 > 2
RESULT of EXECUTION
bioruby> 1 > 2
==> false
Large and small comparison with equal
1<=1
RESULT of EXECUTION
bioruby> 1 <= 1
==> true
Large and small comparison with equal again
1 >= 2
RESULT of EXECUTION
bioruby> 1 >= 2
==> false
Comparisons, agree, not agree
1 == 1
RESULT of EXECUTION
bioruby> 1 == 1
==> true
Comparisons, agree, not agree again
1 != 2
RESULT of EXECUTION
bioruby> 1 != 2
==> true
If
'If' is the basic for condition branch. When you handle same content of values and use many branching conditions,
use 'case'.
Use 'if' for condition branching. In the next example, program evaluates each elements one by one of a string 'nucleotide sequence' and when it detect "A", return value "Adenine".
dna = "ACGT"
dna.split("").each {|nucleotide|
if nucleotide == "A" then
puts "Adenine."
end
}
RESULT of EXECUTION
bioruby> dna.split("").each {|nucleotide|
bioruby+ if nucleotide == "A" then
puts "Adenine."
end
}
Adenine.
==> ["A", "C", "G", "T"]
Now, this program evaluates the condition (if nucleotide == "A"), if its "A", return "Adenine", or do 'Nothing" otherwise.
Let's take a look at condition if its "A".
nucleotide = "A"
nucleotide == "A"
RESULT of EXECUTION
bioruby> nucleotide = "A"
==> "A"
bioruby> nucleotide == "A"
==> true
Boolean returns true if use '=='.
Let's see when its "T".
nucleotide = "T"
nucleotide == "A"
RESULT of EXECUTION
bioruby> nucleotide = "T"
==> "T"
bioruby> nucleotide == "A"
==> false
Boolean returns false if use '=='.
As you saw here, value of true or false is evaluated in condition branching of 'if'.
In 'if', its true, execute after then ( in this case, 'puts "Adenine").
If-else
We want to setup something if all conditions are false.
We use 'if-then-else'. it executes if all conditions are false, the command followed by 'else'.
If its "A" here , then 'puts "Adenine", otherwise executes 'puts "Others" '.
dna = "ACGT"
dna.split("").each {|nucleotide|
if nucleotide == "A" then
puts "Adenine."
else
puts "Others."
end
}
RESULT of EXECUTION
bioruby> dna = "ACGT"
==> "ACGT"
bioruby> dna.split("").each {|nucleotide|
bioruby+ if nucleotide == "A" then
puts "Adenine."
else
bioruby+ puts "Others."
end
}
Adenine.
Others.
Others.
Others.
==> ["A", "C", "G", "T"]
case-when
Its easy if the number of values of variables is limited. However, if not, if-then-else will be jammed.
If you have such a case, use 'case-when' to sort out many conditions.
dna = "ACGT"
dna.split("").each { |nucleotide|
case nucleotide
when "A","G"
puts "Purine."
when "C","T","U"
puts "Pyrimidine."
end
}
RESULT of EXECUTION
bioruby> dna = "ACGT"
==> "ACGT"
bioruby> dna.split("").each { |nucleotide|
bioruby+ case nucleotide
when "A","G"
puts "Purine."
when "C","T","U"
puts "Pyrimidine."
end
}
Purine.
Pyrimidine.
Purine.
Pyrimidine.
==> ["A", "C", "G", "T"]
You can use 'else' in the case of 'case-when' like 'if'.
It also can be used not only the code after 'when'.
You can use it in the regular expression which will be described later.
I will come back this issue again later.
Summary of Conditional Branching
1. Boolean value is specific value 'true' and 'false'.
2. Boolean value return after comparison.
3. It may be easier to use 'case-when' if you encounter a little complex branching.
Regular Expression
Regular Expression is very important to handle strings.
Regular Expression is for clusters of strings.
For example. all lower cases of alphabet can be expressed by [a-z].
Things You Can Do If You Know Regular Expression
1. Search motifs in sequences
2. Extract information necessary from strings
Search for more than one characters from DNA sequences( Upper and Lower cases)
/[ACGTacgt]+/
Example of match
"acgt" =~ /[ACGTacgt]+/
RESULT of EXECUTION
bioruby> "acgt" =~ /[ACGTacgt]+/
==> 0
Return the match if matches.
[ACGTacgt] indicates one of character inside [], + indicates one or more. Therefore, /[ACGTacgt]+/ is a regular expression which has a string which has one or more than one characters inside []. Regular Expression searches strings between / and /.
...................back later
Example Not Matching
"acgt" =~ /[ACGTacgt]+/
RESULT of EXECUTION
"bbbb" =~ /[ACGTacgt]+/
RESULT of EXECUTION
bioruby> "bbbb" =~ /[ACGTacgt]+/
==> nil
If NOT match, return 'nil'.
String
1. Symbol indicates string is '\w'. Non-strings are shown as '\W'.
2. numerals can be shown as '\d'. Non-numerals are '\D'.
3. Other characters ( space, tab,newline, return) are '\s'. All others are '\S'.
A single string is given by .
Length
1. One and more is '+'.
2. Zero and more is '*'.
3. Length is 'length'
4. Range of length is ' minimum length, maximum length' ( Please make sure in your English Ruby Book for these)
Further, index for position
1. head ^
2. last $
Regular Expression for STOP CODONS by IUPAC codes
stop_codon = /U(A[AG]|GA)/
"UAA" =~ stop_codon
"UAG" =~ stop_codon
"UGA" =~ stop_codon
"AUG" =~ stop_codon
RESULT of EXECUTION
bioruby> top_codon = /U(A[AG]|GA)/
==> /U(A[AG]|GA)/
bioruby> "UAA" =~ stop_codon
==> 0
bioruby> "UAG" =~ stop_codon
==> 0
bioruby> "UGA" =~ stop_codon
==> 0
bioruby> "AUG" =~ stop_codon
==> nil
Now explain /U(A[AG]|GA)/.
1. Three STOP CODONS "UAA","UAG","UGA".
2. In regular Expression to express all three STOP CODONS, /(UAA|UAG|UGA)/
here '|' means 'or' and make gouping with ( ). (grouping of strings)
3. Next, all STOP CODONS start with 'U', therefore, you can express these as /U(AA|AG|GA)/.
4. Further grouping with /U(A|[AG]|GA)/, meaning A[AG] means AA or AG
As shown in these examples, Regular Expression can take many different ways to match same strings.
It may need Maestro skills to more shorter Regular Expression (Regex).
Regular Expression will be used to find strings to match patterns. Matching strings can be used
1. substitutions of strings (gsub, sub)
2. detect match of strings (scan)
3. true or false condition (if, case-when, =~)
.
For example, you can write script to detect if sequences are DAN or RNA as follows:
case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
If its DNA sequence
sequence = 'ACTG' * 100
case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
RESULT of EXECUTION
bioruby> sequence = 'ACTG' * 100
==> "ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG"
bioruby> case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
DNA
==> nil
Here sequence is DNA.
If the sequence is RNA
sequence = 'AUCG' * 100
case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
RESULT of EXECUTION
bioruby> sequence = 'AUCG' * 100
==> "AUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCG"
bioruby> case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
RNA
==> nil
Now its RNA sequence.
Capture Matched Strings
Matched string will be stored as match variables $1 through $9.
if position = "acgtttttgagtggtaccgttttat" =~ /([acgt]gt)/
p position
p $1
end
RESULT of EXECUTION
bioruby> if position = "acgtttttgagtggtaccgttttat" =~ /([acgt]gt)/
p position
p $1
end
1
"cgt"
==> nil
A string which match /([acgt]gt)/ can be found in the string from the second character 'ctg'.
However, there are more matching patterns.
To obtain all strings matching, use 'scan'
matches = "acgtttttgagtggtaccgttttat".scan(/([acgt]gt)/)
p matches
RESULT of EXECUTION
bioruby> matches = "acgtttttgagtggtaccgttttat".scan(/([acgt]gt)/)
==> [["cgt"], ["agt"], ["ggt"], ["cgt"]]
bioruby> p matches
[["cgt"], ["agt"], ["ggt"], ["cgt"]]
You can imagine to capture all possible STOP CODONS by using 'scan'.
Summary of Regular Expression
1. Regular Expression is the method for clusters of strings.
2. strings and length and positions are expressed by different symbols.
3. You can capture matched positions and strings.