The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...' --Isaac Asimov
That’s Funny… random header image

Open Source Software Notes

I have to make a lot of notes to myself about how to do stuff on the computer.
quick edit link

Linux/Kubuntu

to fast copy over the network with ssh:

cd /destionation/dir/ && ssh SOURCE "cd /source/dir && tar -czvf - *" | tar xzf -

to find the day of year of a particular date:

date --date='27 Nov 2007' +%j

to get the details on an arbitrary list of files:

locate [file] > filelist.txt
while read file; do ls -l $file; done < filelist.txt

to quickly scan through a text file for a word. then use ‘n’ and ‘N’ to search forward and backward:

cat [file] | less -p [word]

to remove all of the blank lines in a text document:

sed -i '/^$/d' [file]

to add an extension (here .csv) to all files in a directory:

rename 's/$/.csv/' *

Bioinformatics

to prepend “>filename” to every FASTA file in a directory:

#!/bin/bash
for file in ./*.fasta; do
foo=${file##*/};
bar=${foo%.*};
sed -i "1i \>$bar" $file;
echo $bar;
done

download complete genome sequences from JGI Integrated Microbial Genomes (IMG) using a list of IMG taxon ids (input.txt)

#!/bin/sh

for i in $(cat input.txt);
do echo $i
FILE=$i.fasta
BASE="http://img.jgi.doe.gov/cgi-bin/pub/main.cgi?section=TaxonDetail&downloadTaxonFnaFile=1&_noHeader=1&taxon_oid="
URL="${BASE}${i}"
wget $URL -O $FILE
done

to find all of the EC numbers in [file], sort, de-replicate, count, and print them by order of decreasing frequence

grep -o -P 'EC\W*\d\.\d\.\d\.\d' [file] | sort | uniq -c | sort -rn > output.txt

ARB import filter to read full_name from a FASTA file. Save to $ARBHOME/lib/import/
From of FASTA file should be >[name][tab][full_name]

AUTODETECT      ">*"
        #Global settings:
KEYWIDTH        1

BEGIN   ">??*"

MATCH   ">*"
        SRT "* *=*1:*\t*=*1"
        WRITE "name"

MATCH   ">*"
        SRT "*\t*=*2"
        WRITE "full_name"

SEQUENCEAFTER   "*"
SEQUENCESRT     ""
SEQUENCECOLUMN  0
SEQUENCEEND     ">*"

# DONT_GEN_NAMES
CREATE_ACC_FROM_SEQUENCE

END     "//"

perl script to translate names in tree files or sequence files, given the file to convert and a 2-column translation table. will probably need to be edited depending on type of file. save as ‘myconvert.pl’, make it executable ‘chmod +x myconvert.pl’, and run as ‘./myconvert.pl [treefile] [translationfile]‘

#!/usr/bin/perl
use strict;

my $treefile = $ARGV[0]; # newick-like tree
my $translatefile = $ARGV[1]; #names to translate
my %namehash = ();
my %outhash = ();
open(FILE, "< $translatefile") or die;
while(<FILE>) {
    chomp;
    my @array = split(/\t/); #split on tab
    $array[1] =~ s/[ \/\(\)']/_/g; #replace bad chars with underscore
    $namehash{$array[0]} = $array[1];
}
close FILE;
open(FILE, "< $treefile") or die;
LINE: while(<FILE>) {
#   chomp; #uncomment to remove newlines
#   s/^[ \t]*//; #uncomment to replace whitespace at beginning of line
#     s/['"]//g; #uncomment to delete quotation marks
    foreach my $phyname (keys %namehash) {
        s/$phyname/$namehash{$phyname}/;
    }
    print "$_";
}
close FILE;

LaTeX

to generate a clean one-page HTML output of a TeX document

latex2html -split 0 -no_navigation -info 0 -address 0 [file.tex]

to convert normal quotes into LaTeX quotes

sed 's/"\([^"]*"\)/``\1/g' [inputfile] > [outputfile]

to globally comment out/not run figures in LaTeX, put it at the end of the preamble

Engauge [Graph/Plot] Digitizer

Use this excellent program to convert an image of a graph into usable X/Y data points. It expects plots that do NOT have multiple Y values, so rotate images (e.g. P vs. depth) by 90 before you import them. If your plot has multiple colors it is easiest to digitize, in that case just use the ‘discretize’ options and turn off the ‘grid removal’ options. There are tutorials available at the Engauge site on SourceForge.

sudo apt-get install engauge-digitizer