Biology researches more and more rely on computing for biological answers; even wet-lab based studies need to apply software for data analyses. Learning bioinformatics doesn’t mean you have to become a computer science guru; however, picking up computing and programming skills facilities your researches a lot. In this article, I’ll introduce essential and supplemental skills for a more smooth career in bioinformatics.
We list some common and recommended skills. They are ranked roughly from general skills to specific ones. Here is the list:
There are some supplemental skills as well. Learn them only when needed.
- Sed and AWK
- Git, Mercurial or other version control systems
- Make, SCons, or other task manager.
- C and/or C++
If the skill list is overwhelming, consider this simple advice: learn basic GNU/Linux (including an editor); then learn Python and R. You don’t need to begin your research after mastering all these languages and tools, just learn what is needed. Even a subset of these skills helps you a lot. We’ll introduce these skills in the following paragraphs.
GNU/Linux is a free and open-source Unix-like operating system. One reason to use Linux is that many binformatic software run solely on Linux or Unix. However, there are many other advantages to use Linux rather than Windows for your research. Many bioinformatic software are distributed as command-line utilities and Linux or Unix provides better interactive experiences. Installing bioinformatic software on Linux is easier on Linux or Unix as well. Besides, there are many programming languages and utilities available on Linux, including sed, AWK, Perl, Python, R, etc. You don’t need to be a Linux master, but basic familiarity to Linux is helpful.
Even you are not goint to become a professional programmer, a programmer-oriented editor helps you a lot in your daily Linux use. Text files are ubiquitous in Unix culture, not limited in
.txt file extension but in any available text format. You’ll spend a lot of time getting along with text files in your Linux life, so the importance of a productive editor can not be over-emphasized. Some Linux newcomers may prefer GUI editors e.g. gedit or Kate. Nevertheless, command-line editors integrate with Linux shell better. You don’t need to switch between mouse and keyboard. Among command-line editors in Linux, Vim and Emacs are the most famous ones. They diff in many aspects. You may try both and choose your preferred one.
Perl was initially developed for a better text processing and reporting tool, but envolved into a versatile and eclectic general-purpose programming language. Perl is the agglomeration of Unix culture, absorbing the essences of C, shell script, sed, AWK, etc. You can use Perl for scripting tasks, as a command-line utility (one-line Perl), and to develop sophisticated applications. Perl is a standard part of Linux and Unix, pre-installed on virtual all Linux and Unix machine. There are also many bioinformatic applications and modules written in Perl. Perl is easy to pick up; you don’t need a computer science degree to master Perl. Besides, many routine tasks have been implemented as Perl modules, so you don’t need to write everything from scratch. CPAN (Compresensive Perl Archive Network) provides free and easy access to these Perl modules. BioPerl is a plethora of Perl modules for bioformatic tasks.
R is a free implementation of S language, a statistic programming language. R is designed as both an interactive environment and a programming language. Even you don’t know how to do R programming, you can still use R for data analysis. Due to its openness, everyone can easily develop new packages in R. Therefore, you can find functions and packages virtually for all statistical tasks. Besides, there are also packages for machine learning, image analysis, natural language processing, data visualization, etc. CRAN (Compresensive R Archive Network) is the main repository for R packages. Besides, Bioconductor collects many bioinformatic R packages not seen on CRAN.
Python, like Perl, is a general-purpose programming language. Python is notable for its clear and succinct syntax, nicknamed as “executable pseudocode”. Python doesn’t support command-line oriented programming as Perl one-liners. Besides, to accomplish the same task, the length of Python code is usually a little longer than that of Perl code. However, Python has better support for objected-oriented programming than Perl, suitable for sophisticated application development. Since Perl and Python overlap in many fields, the choice between them are sometimes personal preference. Python also provides similiar environment for scientific computing as R does. More and more bioinformatic applications are writen in Python such as tophat and bowtie. Galaxy is a sophisticated web-based biomedical research platform written in Python. BioPython, as BioPerl, are a collection of Python modules for binformatics.
Databases are the foundation of modern bioinformatic websites; SQL is the lingua franca of relational databases. Databases provide effective abstrations of data storages, retrievals, and modifications. There are many relational databases available, either commercial or free. MySQL and PostgreSQL are two popular choices for bioinformatic websites. SQLite is suitable for single-user, offline applications. The SQL of different databases has slightly different syntaxes; therefore, check the manuals for the differences.
Before the birth of Perl, Unix has its own tools for text processing. Sed and AWK are the representives. Both Sed and AWK are line-orinted text processing tools. Besides, AWK has the ability to deal with field-seperated data. AWK itself is a programming language as well. Although AWK is mainly used for one-line program, you can also use AWK to write complex applications. The niches of sed and AWK are largely superseded by Perl. However, they are faster and simpler. For simple tasks, their code are shorter than that of Perl.
If you only need to write one-line program or simple scripts, VCS (version control system) is overkill. However, if you want to develop sophisticated applications, VCS can save you from various errors and unpredicted conditions. Distributed VCSes like Git or Mercurial become more popular than traditional VCSes. They provides similiar capacities, so you only need to pick one. You usually need online storage and management for your code and data, so consider Github, Bitbucket, or other VCS repository service providers.
If you find that you are doing repetitive tasks on a console, try
make to save time.
make is not merely a software building tool; you can use
make for for virtually any task on a console. To use
make, you need to write a Makefile script. You may also use SCons, a Python alternative for
make. The advantage of SCons is that you can use a familiar Python syntax rather than a new language to write task scripts.
MATLAB is a high-level language and interactive environment for scientific and technical computing. It is used in many fields, including numeric computation, linear algebra, data analysis, Fourier analysis, etc. Different features are provided in toolboxes and sold seperately. MATLAB also support bioinformatics and system biology through related toolboxes.
C and C++ are not used routinely in data analysis tasks but for system and application development. When developing bioinformatic applications, using high-level languages like Perl, Python, or R are easier and quicker. But you can re-write the performance-critical part in C or C++ and adapt it to your application. Comparing to other high-level languages, C and C++ are more diffcult to learn. Developing applications with C/C++ also take longer time. Therefore, few bioinformatic applications are developed solely in C/C++.
Java is a general-purpose programming language developed by Sun and acquired by Oracle later. Java provides many improvements over C++ and is easier to learn (compared with C++). Like C/C++, Java is not used for data analysis but for application development. Unlike C/C++, Java code is not compiled into machine code but intermediate bytecode. Then, bytecode is interpreted by Java virtual machine. Java virtual machine and development tools are available for several platforms, including Windows, Linux, Mac OS X, and Solaris. Most Java code can run on different platforms without modification. Some examples of bioinformatic applications are FastQC, GSEA, and BioJava.