Extracting fasta headers with Julia

Today I needed to extract the headers of multiple fasta files in order to compare them. R is no good for this so I though this would be a good job to test my Julia skills again. This was surprisingly straightforward to do with Julia’s FASTX package.

The script below does the following:

  • Search a directory for files with the .faa extension
  • Open each file
  • Loop over each sequence and extract its header + description
  • Write file name, header and description in a comma separated file
# Extract all fasta headers of faa files

# Define packages and functions

## Load FASTX
using FASTX

## Define function for directory search of a keyword
searchdir(path, key) = filter(x -> occursin(key, x), readdir(path))

# Define file locations
in_dir = "project/faa_files/"
outfile = open("faa_headers.csv", "w")

# Do the extraction
for file in searchdir(in_dir, ".faa")

        reader = open(FASTA.Reader, string(in_dir, file))

        for seq in reader
                identifier = FASTA.identifier(seq)
                description = FASTA.description(seq)
                write(outfile, string(file, ";", identifier, ";", description, "\n"))
        end

        close(reader)

end

close(outfile)

It worked well but since I am still learning, I am not sure if this is the proper way to handle things in Julia.

As always, comments are very welcome!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.