2007年11月19日月曜日

Nakao-Sensei's LectureNote 2007-10-10

Programming Basics III
Translation "http://bioruby.g.hatena.ne.jp/nakao_mitsuteru/20071010/II"

By Nakao Mitsuteru (Lecture Note given on 2007-10-10)

Programming Basics II: Repeats, Code, Condition, Regular Expression

In this chapter, I will introduce Repeat, Code, Condition, Regular Expression.
These are widely often used not only by Bioinformatics, but also by Programming. Specially Regular Expression is a must item in order to treat DNA sequence as a string.
In addition, I will introduce many concepts specific to Programming here.

Repeats

It is very easy to use repeats by Programming language Ruby.

Repeat 100 times

REPEAT UNIT

puts "Hello !"

RESULT of EXECUTION

bioruby> puts "Hello !"
Hello !
==> nil

Do following to repeat this 100 times

bioruby> 100.times { puts "Hello !" }

RESULT of EXECUTION

bioruby> 100.times { puts "Hello !" }
Hello !
Hello !
Hello !
Hello !
Hello !
.........
Hello !
Hello !
Hello !
Hello !
Hello !
==> 100

Segment of program surrounded by { and } is called BLOCK.
100.times is followed by BLOCK and repeat it 100 times.
BLOCK is very often used in Ruby.

This is an interesting feature of Ruby.
If anybody who are interested in this feature, please read BLOCK section in Ruby book.

Repeat 100 times with numbers

repeat 100 times with numbers

100.times {|i|
puts "Hello ! " + i.to_s
}

RESULT of EXECUTION

bioruby> 100.times {|i|
bioruby+ puts "Hello ! " + i.to_s
}
Hello ! 0
Hello ! 1
Hello ! 2
Hello ! 3
Hello ! 4
..............
Hello ! 95
Hello ! 96
Hello ! 97
Hello ! 98
Hello ! 99
==> 100

100.times provides number of repeat cycles.
and place a number of cycle in a variable i between | and | every time.
A number of repeat starts from 0.

This feature is very convenient if you want to repeat it certain times.

Display elements of Array one by one

Use 'each' when you want to evaluate elements of group one by one.

Use 'each'

array = [0, 1, 2, 3, 4]
array.each {|x|
puts x
}

RESULT of EXECUTION

bioruby> array = [0, 1, 2, 3, 4]
==> [0, 1, 2, 3, 4]
bioruby> array.each {|x|
bioruby+ puts x
}
0
1
2
3
4
==> [0, 1, 2, 3, 4]

As seen above, when 'each' is used, each elements can be evaluated one by one.

Display power 2 values of each elements

array.each {|x|
puts x ** 2

RESULT of EXECUTION


bioruby> array.each {|x|
bioruby+ puts x ** 2
}
0
1
4
9
16
==> [0, 1, 2, 3, 4]


You can reverse them to the original elements.
To do this, use 'map'.

Use 'map'

array.map {|x|
x ** 2
}

RESULT of EXECUTION

bioruby> array.map {|x|
bioruby+ x ** 2
}
==> [0, 1, 4, 9, 16]


Use 'map' instead of 'each'. 'map' means mapping.

Input the result of 'map' in array2

array
array2 = array.map {|x|
x ** 2
}
array2

RESULT of EXECUTION

bioruby> array
==> [0, 1, 2, 3, 4]
bioruby> array2 = array.map {|x|
bioruby+ x ** 2
}
==> [0, 1, 4, 9, 16]
bioruby> array2
==> [0, 1, 4, 9, 16]

As shown above 'map' can perform same operation for each elements of Array in very convenient way.
it makes the process to transform a series of data into object.

'each' with index

Use 'each_with_index' when you want to use index of Array at same time

array.each_with_index {|x, i|
puts [i.to_s, x.to_s].join(" ")
}

RESULT of EXECUTION

bioruby> array.each_with_index {|x, i|
bioruby+ puts [i.to_s, x.to_s].join(" ")
}
0 0
1 1
2 2
3 3
4 4
==> [0, 1, 2, 3, 4]


Evaluate Elements of Hash One by One

Use 'each'

hash = {"DNA" => "ACGT", "RNA" => "ACGU"}
hash.each {|key, value|
puts [key, value].join(" ")
}


RESULT of EXECUTION
Use 'each'

bioruby> hash = {"DNA" => "ACGT", "RNA" => "ACGU"}
==> {"DNA"=>"ACGT", "RNA"=>"ACGU"}
bioruby> hash.each {|key, value|
bioruby+ puts [key, value].join(": ")
}
DNA: ACGT
RNA: ACGU
==> {"DNA"=>"ACGT", "RNA"=>"ACGU"}


Summary of Repeats


1. For repeat function, 'each' plays a central role.
2. 'each' can be used for both Array and Hash.
3. Operation for the entire elements of set (mapping) is done by 'map'.

Code

Code means a unit of program. Program composes of Codes.
It is all Codes we have copied and pasted during this practice.

In basic, code returns value. So far, the display followed by ==> after result of execusion using bioruby> prompt
is value of code.

Code "ACGT"

"ACGT"

RESULT of EXECUTION

bioruby> "ACGT"
==> "ACGT"


Value of Code "ACGT" is "ACGT".

Code dna = "ACGT"

dna = "ACGT"

RESULT of EXECUTION

bioruby> dna = "ACGT"
==> "ACGT"

Code 'puts dna'

puts dna

RESULT of EXECUTION

bioruby> puts dna
ACGT
==> nil


Value of 'put dna' is 'nil'. 'nil' is a special value, indicates 'empty'.

Summary of Code


1. We actually copied and pasted 'Codes' so far.
2. 'nil' is value of 'empty'.

Condition

Condition is structure to control process in program.
Now I introduce Boolean judgement (true/false) and control mechanism of branch.

Judgement of true or false

For condition branching, use true or false judgement. One value of code is such true or false.
Ruby use 'true' or 'false' as values.

Large and small comparison


1 < 2

RESULT of EXECUTION

bioruby> 1 < 2
==> true

Large and small comparison again


1 > 2

RESULT of EXECUTION

bioruby> 1 > 2
==> false

Large and small comparison with equal


1<=1

RESULT of EXECUTION

bioruby> 1 <= 1
==> true

Large and small comparison with equal again


1 >= 2

RESULT of EXECUTION

bioruby> 1 >= 2
==> false

Comparisons, agree, not agree


1 == 1

RESULT of EXECUTION

bioruby> 1 == 1
==> true

Comparisons, agree, not agree again


1 != 2

RESULT of EXECUTION

bioruby> 1 != 2
==> true


If

'If' is the basic for condition branch. When you handle same content of values and use many branching conditions,
use 'case'.

Use 'if' for condition branching. In the next example, program evaluates each elements one by one of a string 'nucleotide sequence' and when it detect "A", return value "Adenine".

dna = "ACGT"
dna.split("").each {|nucleotide|
if nucleotide == "A" then
puts "Adenine."
end
}

RESULT of EXECUTION

bioruby> dna.split("").each {|nucleotide|
bioruby+ if nucleotide == "A" then
puts "Adenine."
end
}
Adenine.
==> ["A", "C", "G", "T"]


Now, this program evaluates the condition (if nucleotide == "A"), if its "A", return "Adenine", or do 'Nothing" otherwise.

Let's take a look at condition if its "A".

nucleotide = "A"
nucleotide == "A"

RESULT of EXECUTION

bioruby> nucleotide = "A"
==> "A"
bioruby> nucleotide == "A"
==> true


Boolean returns true if use '=='.

Let's see when its "T".

nucleotide = "T"
nucleotide == "A"

RESULT of EXECUTION

bioruby> nucleotide = "T"
==> "T"
bioruby> nucleotide == "A"
==> false


Boolean returns false if use '=='.

As you saw here, value of true or false is evaluated in condition branching of 'if'.
In 'if', its true, execute after then ( in this case, 'puts "Adenine").

If-else

We want to setup something if all conditions are false.
We use 'if-then-else'. it executes if all conditions are false, the command followed by 'else'.

If its "A" here , then 'puts "Adenine", otherwise executes 'puts "Others" '.

dna = "ACGT"
dna.split("").each {|nucleotide|
if nucleotide == "A" then
puts "Adenine."
else
puts "Others."
end
}

RESULT of EXECUTION

bioruby> dna = "ACGT"
==> "ACGT"
bioruby> dna.split("").each {|nucleotide|
bioruby+ if nucleotide == "A" then
puts "Adenine."
else
bioruby+ puts "Others."
end
}
Adenine.
Others.
Others.
Others.
==> ["A", "C", "G", "T"]


case-when

Its easy if the number of values of variables is limited. However, if not, if-then-else will be jammed.

If you have such a case, use 'case-when' to sort out many conditions.


dna = "ACGT"
dna.split("").each { |nucleotide|
case nucleotide
when "A","G"
puts "Purine."
when "C","T","U"
puts "Pyrimidine."
end
}

RESULT of EXECUTION


bioruby> dna = "ACGT"
==> "ACGT"
bioruby> dna.split("").each { |nucleotide|
bioruby+ case nucleotide
when "A","G"
puts "Purine."
when "C","T","U"
puts "Pyrimidine."
end
}
Purine.
Pyrimidine.
Purine.
Pyrimidine.
==> ["A", "C", "G", "T"]


You can use 'else' in the case of 'case-when' like 'if'.

It also can be used not only the code after 'when'.
You can use it in the regular expression which will be described later.
I will come back this issue again later.

Summary of Conditional Branching


1. Boolean value is specific value 'true' and 'false'.
2. Boolean value return after comparison.
3. It may be easier to use 'case-when' if you encounter a little complex branching.

Regular Expression

Regular Expression is very important to handle strings.
Regular Expression is for clusters of strings.
For example. all lower cases of alphabet can be expressed by [a-z].

Things You Can Do If You Know Regular Expression

1. Search motifs in sequences
2. Extract information necessary from strings

Search for more than one characters from DNA sequences( Upper and Lower cases)

/[ACGTacgt]+/

Example of match

"acgt" =~ /[ACGTacgt]+/

RESULT of EXECUTION

bioruby> "acgt" =~ /[ACGTacgt]+/
==> 0

Return the match if matches.

[ACGTacgt] indicates one of character inside [], + indicates one or more. Therefore, /[ACGTacgt]+/ is a regular expression which has a string which has one or more than one characters inside []. Regular Expression searches strings between / and /.

...................back later

Example Not Matching


"acgt" =~ /[ACGTacgt]+/

RESULT of EXECUTION

"bbbb" =~ /[ACGTacgt]+/

RESULT of EXECUTION

bioruby> "bbbb" =~ /[ACGTacgt]+/
==> nil


If NOT match, return 'nil'.

String

1. Symbol indicates string is '\w'. Non-strings are shown as '\W'.
2. numerals can be shown as '\d'. Non-numerals are '\D'.
3. Other characters ( space, tab,newline, return) are '\s'. All others are '\S'.

A single string is given by .

Length

1. One and more is '+'.
2. Zero and more is '*'.
3. Length is 'length'
4. Range of length is ' minimum length, maximum length' ( Please make sure in your English Ruby Book for these)

Further, index for position

1. head ^
2. last $

Regular Expression for STOP CODONS by IUPAC codes

stop_codon = /U(A[AG]|GA)/
"UAA" =~ stop_codon
"UAG" =~ stop_codon
"UGA" =~ stop_codon
"AUG" =~ stop_codon

RESULT of EXECUTION


bioruby> top_codon = /U(A[AG]|GA)/
==> /U(A[AG]|GA)/
bioruby> "UAA" =~ stop_codon
==> 0
bioruby> "UAG" =~ stop_codon
==> 0
bioruby> "UGA" =~ stop_codon
==> 0
bioruby> "AUG" =~ stop_codon
==> nil


Now explain /U(A[AG]|GA)/.

1. Three STOP CODONS "UAA","UAG","UGA".
2. In regular Expression to express all three STOP CODONS, /(UAA|UAG|UGA)/
here '|' means 'or' and make gouping with ( ). (grouping of strings)
3. Next, all STOP CODONS start with 'U', therefore, you can express these as /U(AA|AG|GA)/.
4. Further grouping with /U(A|[AG]|GA)/, meaning A[AG] means AA or AG

As shown in these examples, Regular Expression can take many different ways to match same strings.
It may need Maestro skills to more shorter Regular Expression (Regex).

Regular Expression will be used to find strings to match patterns. Matching strings can be used

1. substitutions of strings (gsub, sub)
2. detect match of strings (scan)
3. true or false condition (if, case-when, =~)

.


For example, you can write script to detect if sequences are DAN or RNA as follows:

case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end

If its DNA sequence

sequence = 'ACTG' * 100
case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end

RESULT of EXECUTION


bioruby> sequence = 'ACTG' * 100
==> "ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG"
bioruby> case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
DNA
==> nil


Here sequence is DNA.

If the sequence is RNA

sequence = 'AUCG' * 100
case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end

RESULT of EXECUTION


bioruby> sequence = 'AUCG' * 100
==> "AUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCG"
bioruby> case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
RNA
==> nil


Now its RNA sequence.

Capture Matched Strings

Matched string will be stored as match variables $1 through $9.

if position = "acgtttttgagtggtaccgttttat" =~ /([acgt]gt)/
p position
p $1
end

RESULT of EXECUTION

bioruby> if position = "acgtttttgagtggtaccgttttat" =~ /([acgt]gt)/
p position
p $1
end
1
"cgt"
==> nil


A string which match /([acgt]gt)/ can be found in the string from the second character 'ctg'.
However, there are more matching patterns.

To obtain all strings matching, use 'scan'

matches = "acgtttttgagtggtaccgttttat".scan(/([acgt]gt)/)
p matches

RESULT of EXECUTION

bioruby> matches = "acgtttttgagtggtaccgttttat".scan(/([acgt]gt)/)
==> [["cgt"], ["agt"], ["ggt"], ["cgt"]]
bioruby> p matches
[["cgt"], ["agt"], ["ggt"], ["cgt"]]


You can imagine to capture all possible STOP CODONS by using 'scan'.

Summary of Regular Expression

1. Regular Expression is the method for clusters of strings.
2. strings and length and positions are expressed by different symbols.
3. You can capture matched positions and strings.

0 件のコメント: