2007年11月19日月曜日

Nakao-Sensei's LectureNote 2007-10-10

Programming Basics III
Translation "http://bioruby.g.hatena.ne.jp/nakao_mitsuteru/20071010/II"

By Nakao Mitsuteru (Lecture Note given on 2007-10-10)

Programming Basics II: Repeats, Code, Condition, Regular Expression

In this chapter, I will introduce Repeat, Code, Condition, Regular Expression.
These are widely often used not only by Bioinformatics, but also by Programming. Specially Regular Expression is a must item in order to treat DNA sequence as a string.
In addition, I will introduce many concepts specific to Programming here.

Repeats

It is very easy to use repeats by Programming language Ruby.

Repeat 100 times

REPEAT UNIT

puts "Hello !"

RESULT of EXECUTION

bioruby> puts "Hello !"
Hello !
==> nil

Do following to repeat this 100 times

bioruby> 100.times { puts "Hello !" }

RESULT of EXECUTION

bioruby> 100.times { puts "Hello !" }
Hello !
Hello !
Hello !
Hello !
Hello !
.........
Hello !
Hello !
Hello !
Hello !
Hello !
==> 100

Segment of program surrounded by { and } is called BLOCK.
100.times is followed by BLOCK and repeat it 100 times.
BLOCK is very often used in Ruby.

This is an interesting feature of Ruby.
If anybody who are interested in this feature, please read BLOCK section in Ruby book.

Repeat 100 times with numbers

repeat 100 times with numbers

100.times {|i|
puts "Hello ! " + i.to_s
}

RESULT of EXECUTION

bioruby> 100.times {|i|
bioruby+ puts "Hello ! " + i.to_s
}
Hello ! 0
Hello ! 1
Hello ! 2
Hello ! 3
Hello ! 4
..............
Hello ! 95
Hello ! 96
Hello ! 97
Hello ! 98
Hello ! 99
==> 100

100.times provides number of repeat cycles.
and place a number of cycle in a variable i between | and | every time.
A number of repeat starts from 0.

This feature is very convenient if you want to repeat it certain times.

Display elements of Array one by one

Use 'each' when you want to evaluate elements of group one by one.

Use 'each'

array = [0, 1, 2, 3, 4]
array.each {|x|
puts x
}

RESULT of EXECUTION

bioruby> array = [0, 1, 2, 3, 4]
==> [0, 1, 2, 3, 4]
bioruby> array.each {|x|
bioruby+ puts x
}
0
1
2
3
4
==> [0, 1, 2, 3, 4]

As seen above, when 'each' is used, each elements can be evaluated one by one.

Display power 2 values of each elements

array.each {|x|
puts x ** 2

RESULT of EXECUTION


bioruby> array.each {|x|
bioruby+ puts x ** 2
}
0
1
4
9
16
==> [0, 1, 2, 3, 4]


You can reverse them to the original elements.
To do this, use 'map'.

Use 'map'

array.map {|x|
x ** 2
}

RESULT of EXECUTION

bioruby> array.map {|x|
bioruby+ x ** 2
}
==> [0, 1, 4, 9, 16]


Use 'map' instead of 'each'. 'map' means mapping.

Input the result of 'map' in array2

array
array2 = array.map {|x|
x ** 2
}
array2

RESULT of EXECUTION

bioruby> array
==> [0, 1, 2, 3, 4]
bioruby> array2 = array.map {|x|
bioruby+ x ** 2
}
==> [0, 1, 4, 9, 16]
bioruby> array2
==> [0, 1, 4, 9, 16]

As shown above 'map' can perform same operation for each elements of Array in very convenient way.
it makes the process to transform a series of data into object.

'each' with index

Use 'each_with_index' when you want to use index of Array at same time

array.each_with_index {|x, i|
puts [i.to_s, x.to_s].join(" ")
}

RESULT of EXECUTION

bioruby> array.each_with_index {|x, i|
bioruby+ puts [i.to_s, x.to_s].join(" ")
}
0 0
1 1
2 2
3 3
4 4
==> [0, 1, 2, 3, 4]


Evaluate Elements of Hash One by One

Use 'each'

hash = {"DNA" => "ACGT", "RNA" => "ACGU"}
hash.each {|key, value|
puts [key, value].join(" ")
}


RESULT of EXECUTION
Use 'each'

bioruby> hash = {"DNA" => "ACGT", "RNA" => "ACGU"}
==> {"DNA"=>"ACGT", "RNA"=>"ACGU"}
bioruby> hash.each {|key, value|
bioruby+ puts [key, value].join(": ")
}
DNA: ACGT
RNA: ACGU
==> {"DNA"=>"ACGT", "RNA"=>"ACGU"}


Summary of Repeats


1. For repeat function, 'each' plays a central role.
2. 'each' can be used for both Array and Hash.
3. Operation for the entire elements of set (mapping) is done by 'map'.

Code

Code means a unit of program. Program composes of Codes.
It is all Codes we have copied and pasted during this practice.

In basic, code returns value. So far, the display followed by ==> after result of execusion using bioruby> prompt
is value of code.

Code "ACGT"

"ACGT"

RESULT of EXECUTION

bioruby> "ACGT"
==> "ACGT"


Value of Code "ACGT" is "ACGT".

Code dna = "ACGT"

dna = "ACGT"

RESULT of EXECUTION

bioruby> dna = "ACGT"
==> "ACGT"

Code 'puts dna'

puts dna

RESULT of EXECUTION

bioruby> puts dna
ACGT
==> nil


Value of 'put dna' is 'nil'. 'nil' is a special value, indicates 'empty'.

Summary of Code


1. We actually copied and pasted 'Codes' so far.
2. 'nil' is value of 'empty'.

Condition

Condition is structure to control process in program.
Now I introduce Boolean judgement (true/false) and control mechanism of branch.

Judgement of true or false

For condition branching, use true or false judgement. One value of code is such true or false.
Ruby use 'true' or 'false' as values.

Large and small comparison


1 < 2

RESULT of EXECUTION

bioruby> 1 < 2
==> true

Large and small comparison again


1 > 2

RESULT of EXECUTION

bioruby> 1 > 2
==> false

Large and small comparison with equal


1<=1

RESULT of EXECUTION

bioruby> 1 <= 1
==> true

Large and small comparison with equal again


1 >= 2

RESULT of EXECUTION

bioruby> 1 >= 2
==> false

Comparisons, agree, not agree


1 == 1

RESULT of EXECUTION

bioruby> 1 == 1
==> true

Comparisons, agree, not agree again


1 != 2

RESULT of EXECUTION

bioruby> 1 != 2
==> true


If

'If' is the basic for condition branch. When you handle same content of values and use many branching conditions,
use 'case'.

Use 'if' for condition branching. In the next example, program evaluates each elements one by one of a string 'nucleotide sequence' and when it detect "A", return value "Adenine".

dna = "ACGT"
dna.split("").each {|nucleotide|
if nucleotide == "A" then
puts "Adenine."
end
}

RESULT of EXECUTION

bioruby> dna.split("").each {|nucleotide|
bioruby+ if nucleotide == "A" then
puts "Adenine."
end
}
Adenine.
==> ["A", "C", "G", "T"]


Now, this program evaluates the condition (if nucleotide == "A"), if its "A", return "Adenine", or do 'Nothing" otherwise.

Let's take a look at condition if its "A".

nucleotide = "A"
nucleotide == "A"

RESULT of EXECUTION

bioruby> nucleotide = "A"
==> "A"
bioruby> nucleotide == "A"
==> true


Boolean returns true if use '=='.

Let's see when its "T".

nucleotide = "T"
nucleotide == "A"

RESULT of EXECUTION

bioruby> nucleotide = "T"
==> "T"
bioruby> nucleotide == "A"
==> false


Boolean returns false if use '=='.

As you saw here, value of true or false is evaluated in condition branching of 'if'.
In 'if', its true, execute after then ( in this case, 'puts "Adenine").

If-else

We want to setup something if all conditions are false.
We use 'if-then-else'. it executes if all conditions are false, the command followed by 'else'.

If its "A" here , then 'puts "Adenine", otherwise executes 'puts "Others" '.

dna = "ACGT"
dna.split("").each {|nucleotide|
if nucleotide == "A" then
puts "Adenine."
else
puts "Others."
end
}

RESULT of EXECUTION

bioruby> dna = "ACGT"
==> "ACGT"
bioruby> dna.split("").each {|nucleotide|
bioruby+ if nucleotide == "A" then
puts "Adenine."
else
bioruby+ puts "Others."
end
}
Adenine.
Others.
Others.
Others.
==> ["A", "C", "G", "T"]


case-when

Its easy if the number of values of variables is limited. However, if not, if-then-else will be jammed.

If you have such a case, use 'case-when' to sort out many conditions.


dna = "ACGT"
dna.split("").each { |nucleotide|
case nucleotide
when "A","G"
puts "Purine."
when "C","T","U"
puts "Pyrimidine."
end
}

RESULT of EXECUTION


bioruby> dna = "ACGT"
==> "ACGT"
bioruby> dna.split("").each { |nucleotide|
bioruby+ case nucleotide
when "A","G"
puts "Purine."
when "C","T","U"
puts "Pyrimidine."
end
}
Purine.
Pyrimidine.
Purine.
Pyrimidine.
==> ["A", "C", "G", "T"]


You can use 'else' in the case of 'case-when' like 'if'.

It also can be used not only the code after 'when'.
You can use it in the regular expression which will be described later.
I will come back this issue again later.

Summary of Conditional Branching


1. Boolean value is specific value 'true' and 'false'.
2. Boolean value return after comparison.
3. It may be easier to use 'case-when' if you encounter a little complex branching.

Regular Expression

Regular Expression is very important to handle strings.
Regular Expression is for clusters of strings.
For example. all lower cases of alphabet can be expressed by [a-z].

Things You Can Do If You Know Regular Expression

1. Search motifs in sequences
2. Extract information necessary from strings

Search for more than one characters from DNA sequences( Upper and Lower cases)

/[ACGTacgt]+/

Example of match

"acgt" =~ /[ACGTacgt]+/

RESULT of EXECUTION

bioruby> "acgt" =~ /[ACGTacgt]+/
==> 0

Return the match if matches.

[ACGTacgt] indicates one of character inside [], + indicates one or more. Therefore, /[ACGTacgt]+/ is a regular expression which has a string which has one or more than one characters inside []. Regular Expression searches strings between / and /.

...................back later

Example Not Matching


"acgt" =~ /[ACGTacgt]+/

RESULT of EXECUTION

"bbbb" =~ /[ACGTacgt]+/

RESULT of EXECUTION

bioruby> "bbbb" =~ /[ACGTacgt]+/
==> nil


If NOT match, return 'nil'.

String

1. Symbol indicates string is '\w'. Non-strings are shown as '\W'.
2. numerals can be shown as '\d'. Non-numerals are '\D'.
3. Other characters ( space, tab,newline, return) are '\s'. All others are '\S'.

A single string is given by .

Length

1. One and more is '+'.
2. Zero and more is '*'.
3. Length is 'length'
4. Range of length is ' minimum length, maximum length' ( Please make sure in your English Ruby Book for these)

Further, index for position

1. head ^
2. last $

Regular Expression for STOP CODONS by IUPAC codes

stop_codon = /U(A[AG]|GA)/
"UAA" =~ stop_codon
"UAG" =~ stop_codon
"UGA" =~ stop_codon
"AUG" =~ stop_codon

RESULT of EXECUTION


bioruby> top_codon = /U(A[AG]|GA)/
==> /U(A[AG]|GA)/
bioruby> "UAA" =~ stop_codon
==> 0
bioruby> "UAG" =~ stop_codon
==> 0
bioruby> "UGA" =~ stop_codon
==> 0
bioruby> "AUG" =~ stop_codon
==> nil


Now explain /U(A[AG]|GA)/.

1. Three STOP CODONS "UAA","UAG","UGA".
2. In regular Expression to express all three STOP CODONS, /(UAA|UAG|UGA)/
here '|' means 'or' and make gouping with ( ). (grouping of strings)
3. Next, all STOP CODONS start with 'U', therefore, you can express these as /U(AA|AG|GA)/.
4. Further grouping with /U(A|[AG]|GA)/, meaning A[AG] means AA or AG

As shown in these examples, Regular Expression can take many different ways to match same strings.
It may need Maestro skills to more shorter Regular Expression (Regex).

Regular Expression will be used to find strings to match patterns. Matching strings can be used

1. substitutions of strings (gsub, sub)
2. detect match of strings (scan)
3. true or false condition (if, case-when, =~)

.


For example, you can write script to detect if sequences are DAN or RNA as follows:

case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end

If its DNA sequence

sequence = 'ACTG' * 100
case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end

RESULT of EXECUTION


bioruby> sequence = 'ACTG' * 100
==> "ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG"
bioruby> case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
DNA
==> nil


Here sequence is DNA.

If the sequence is RNA

sequence = 'AUCG' * 100
case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end

RESULT of EXECUTION


bioruby> sequence = 'AUCG' * 100
==> "AUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCGAUCG"
bioruby> case sequence
when /^[ACGT]+$/
puts "DNA"
when /^[ACGU]+$/
puts "RNA"
end
RNA
==> nil


Now its RNA sequence.

Capture Matched Strings

Matched string will be stored as match variables $1 through $9.

if position = "acgtttttgagtggtaccgttttat" =~ /([acgt]gt)/
p position
p $1
end

RESULT of EXECUTION

bioruby> if position = "acgtttttgagtggtaccgttttat" =~ /([acgt]gt)/
p position
p $1
end
1
"cgt"
==> nil


A string which match /([acgt]gt)/ can be found in the string from the second character 'ctg'.
However, there are more matching patterns.

To obtain all strings matching, use 'scan'

matches = "acgtttttgagtggtaccgttttat".scan(/([acgt]gt)/)
p matches

RESULT of EXECUTION

bioruby> matches = "acgtttttgagtggtaccgttttat".scan(/([acgt]gt)/)
==> [["cgt"], ["agt"], ["ggt"], ["cgt"]]
bioruby> p matches
[["cgt"], ["agt"], ["ggt"], ["cgt"]]


You can imagine to capture all possible STOP CODONS by using 'scan'.

Summary of Regular Expression

1. Regular Expression is the method for clusters of strings.
2. strings and length and positions are expressed by different symbols.
3. You can capture matched positions and strings.

2007年11月17日土曜日

First Try "Syntax Highlighter"

Here I want to try "Syntax Highlighter"


#!/usr/bin/env bioruby

seq = seq("gbphg.seq")
p seq
p seq.translate

2007年11月15日木曜日

投稿 = SUBMIT in Menu

Hi, I realized that in my menu bar, its only Japanese.
Please press 投稿 button for English people to submit your comments.
and for Japanese people, ok its obvious.

I also added few links to Nakao Sensei and Katayama Sensei's sites.
There are some English versions in Nakao Sensei's site.

if you have questions, let me know !

Hello こんにちは。

Jan Aerts and I met at the Cold Spring Harbor Course 'Programming for Biology' which Lincoln Stein organized in the middle of this October. Yes, its during that Watson turmoil. He passionately talked about Ruby, BioRuby movement and I learned a lot from him(He was one of TAs). That course was like two weeks bootcamp and the most exhausting course, but the best course ever for me. Since I am still Japanese and can talk, read and write Japanese like a native language (actually it is) & also I can read, talk & write English with almost no difficulties (more or less difficult than Japanese, I have to confess and my friends told me that I still have heavy Japanese accents but I can make some bad jokes in English), he asked several questions about Japanese and he really wants to learn Japanese to communicate BioRuby developers. After several emails back and forth between US and Scotland, we have decided to make an outreach blog site for BioRuby. Japanese people can write Japanese and non-Japanese people can write English and hopefully I can translate both. My hope is that these two worlds can merge through Ruby and BioRuby.

イアンア−ツとは ニューヨーク州コールドスプリングハーバーで開かれた”生物学のためのプログラミング” というコースで知り合った。これはリンカーンシュタイン博士等により主催されたもので、10月中旬から2週間、そう、あのワトソン事件が起こりつつあるときだった。彼は情熱的にルビーそしてバイオルビー運動について語った。そして僕はいつの間にかルビーの魅力に引き込まれている自分に気がついていた。このコースはまるで2週間のブートキャンプで、しごかれているようで、最もきついものだったが、今までで最高のコースだった。イアンは僕が日本人で日本語ができる事を知って、しかも長年アメリカに暮らして英語でひどいジョークが言えるため(僕の友達は、僕が未だに日本語訛りの英語をしゃべるとからかう。が、実際、英語タイプの方が速く打てる。)日本語に関していろいろ質問してきた。そして、日本語を習いたいと語った。そしてそれは、日本のバイオルビー開発者と語り合いたいためだった。このコースが終わった後、僕たちは米国とスコットランドに、別れたが、電子メールでの 交信は続いた。 そして僕たちはバイオルビーのための英語と日本語でのブログを作ろうという話になった。日本人は日本語で、日本人でない人は英語で 投稿してもらい、僕が訳す。
僕の希望はこのブログを通じて二つの世界が一つになること。