Introducing myself to Julia

I am a huge fan of the statistical programming language R. The initial reason why I fell in love with R is mainly because of it’s awesome data visualisation library ggplot2. The versatility of this package makes it probably the most powerful tool in my data science/bio-informatics toolbox. Slowly, more and more packages related to ggplot2, came on my radar and before I knew it, I was addicted to the Tidyverse. (For those unfamiliar with the Tidyverse; it’s a set of R-packages that share the same philosophy and grammar. It’s great for data science). I did not have a lot of reasons to leave this great universe. It totally made me forget about Python.

As my career progressed, I learned more and more about bio-informatics, which also resulted in increasing sizes of data that had to be crunched. I must say that R did a pretty decent job in handling all of this (I even used it to decode computer files that were stored on DNA). But since I joined the Bork lab, I find myself running into its limits much faster than before. So I feel that it is time to sharpen some other knives in my toolbox. Dusting off my Python skills would be a logic next step, but then I stumbled onto something newer: Julia

Julia: A programming language that walks like Python, but runs like C

Obligatory quote about Julia that I found somewhere on the internet

I don’t know what it is exactly that attracts me to this language (maybe it’s FOMO), but I am somewhat eager to test it out. So whenever I now need to solve a task that seems to be unbearable to be processed in R, I try to turn to Julia. And the first task was:

Extract relevant lines from large file using IDs from other file

So for this task, I would like to extract lines from a large file (32gb; File B) based on an ID (File A) that is stored in another file.

> FileA
ID01
ID02
ID03
...
IDXX
> FileB
ID05  information more_information  even_more_information  102
ID12  information more_information  even_more_information  115
ID89  information more_information  even_more_information  009
...

And this is the first Julia script that I wrote to process these files.

# Read in query file

## Initiate empty vector
queries = Any[]

## Loop through file and populate vector with text
open("fileA") do query_file
        for ln in eachline(query_file)
                push!(queries, ln)
        end
end


# Loop through database file
open("FileB") do file
        ## Loop through lines
        for ln in eachline(file)
                # Extract ID with regex
                ID = match(r"^(\S)+", ln).match

                # search in queries
                if ID in queries
                        println(ln)
                end
        end
end

This was not a great success. I had this script running for more than 30 minutes and it produced nothing at all. For reference; a colleague wrote a Perl script that did the job in 12 minutes. Clearly something was wrong here. So I turned to the pro’s and asked for their advice on the Discourse forum.

One user immediately saw three red flags, the main one being: Avoid Global Variables. (See these performance tips). Based on these tips, I rearranged some code and drafted this Julia script:

# Extract IDs from query file 
function get_ids(file)

        ## Initiate empty vector
        queries = Set{String}()

        ## Loop through file and populate vector with text
        for ln in eachline(file)
                push!(queries, ln)
        end

        return queries

end

# Check ids with db
function compare_ids(queries, db)

        # Loop through db_lines
        for ln in eachline(db)

                # Extract ID with regex
                ID = match(r"^(\S)+", ln).match

                   # Search in queries
                   if ID in queries
                           println(ln)
                    end
        end
end

# Run the actual functions
open("FileA") do query_file

        queries = get_ids(query_file)

        open("FileB") do db_file
                compare_ids(queries, db_file)
        end
end

It is not the most beautiful piece of code that I wrote, but boy, was I happy with its execution time!

To be honest, the above “benchmark” might not be super fair since my colleague spent 5 minutes writing his Perl script, while it took me several rounds to optimise my Julia version. Nevertheless, I have successfully performed my first baby steps into Julia and have confidence it can make quite some impact in future analyses.

But to achieve that, I will definitely need more practice!

Cheers,
Sander

PS: For completeness, here is the Perl script:

open(IN, 'FileA');
while (<IN>) {
 chomp;
$i_need{$_}++;
}

open (IN, 'FileB') ;

while (<IN>) {
  /^(\S+)\t/;
 if ($i_need{$1}) { print; }
}

2 thoughts on “Introducing myself to Julia

  1. If all the IDs are exactly two digits, you can solve the problem much faster (no coding at all and faster runtime) by just using this standard Unix command:

    grep -f FileA FileB

    If the IDs have varying length, the problem becomes a little bit more complex, because one digit-sequence can be a subset of another, but the general idea remains: if the problem is pattern matching in a big text file, just use grep. It has many options and is using the fast (sub-linear) Boyer Moore algorithm.

    1. Thanks for passing by Ariel,

      I am familiar with grep and because of the substring matching, this was not an option. In reality the ID’s were also much longer and more complex than what I described in the example. This would thus definitely lead to a faulty result. In addition, I was surprised that it also had a long runtime on my first test (grep was the first thing I tried), it did not finish on the fly.

      Finally, I would like to point out, as stated in the beginning of the blog post, that this was not so much about finding the fastest solution, but more about “Introducing myself to Julia”.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.