This article introduces you to the basics of Linux, common commands usage, and software installation, and the whole-genome sequencing technology was also introduced to help you get started with bioinformatics.

1. Learning Linux

(A) Directory structure and path

Root directory: The root directory is the top-level directory of a file system, which was represented by "/", and all other directories are subdirectories of the root directory.

Home directory: A home directory, represented by "~/", is the directory that serves as the repository for a user's personal files, directories, and programs. It is also the directory that a user is first in after logging into the system.

Absolute path: An absolute path, also known as the full path, refers to the absolute location of a directory or file, and is the full path starting from the root directory, such as "/home/bio/". The absolute path of the current directory can be obtained by the "pwd" command.

Relative path: A relative path is a way to specify the location of a directory relative to another directory. The relative path does not need to start from the root directory.

Current directory: The current path is represented by "./".

Parent directory: The "../" represents the parent directory, "../../" represents parent directory of the parent directory, and so on.

(B) Nomenclature

(i) Documents and directories need to be named in English, with letters, numbers, and underscores;

(ii) Avoid using spaces in the names of documents and directories;

(iii) The name is case sensitive.

(C) Terminal tools

A terminal is a tool for running Linux commands, similar to the command-line tool of Windows.

Terminal of Linux

When remotely manipulating a Linux server, users can use third-party terminal tools such as PuTTy software.

PuTTy Configuration

File transfer between the local computer and the remote server can be done via an FTP software, such as FileZilla. Enter the server's IP address, username, password, and port to sign in the server. If the server uses the FTP protocol, the port should be 21. If the SFTP protocol is used, the port is set to 22.

FileZilla Configuration

(D) Commonly used commands

pwd: Get the absolute path of the current location

$pwd
			

pwd

mkdir: Create a directory

$mkdir tools
			

mkdir tools

ls: View the contents of the current directory

$ls
			

ls

View all directories and files (including hidden contents)

$ls -a
			

ls -a

View the contents of the root directory

$ls /
			

ls /

View the contents of the home directory

$ls ~/
			

ls ~/

cd: Switch path

Go to the "tools" directory

$cd tools
			

cd tools

vim: create/edit a document All of the following operations must be performed in English input mode. Firstly, create a new document named "example.txt" and enter some words.

$vim example.txt
			

vim example.txt vim example.txt

The user is unable to enter content until the "i" key was pressed to switch to input mode. When "--INSERT --" appears in the lower-left corner, users can enter text.

INSERT input

When the input is completed, users need to press the “ESC” button to exit the edit mode. At this time, the word “--INSERT --” disappears.

ESC

Press "shift + ;" to switch to the vim mode, and a ":" appears in the lower-left corner.

shift + ;

Enter "wq!" to save your changes and exit.

wq!

Check if the created file is in the directory.

$ls
			

ls

cp: copy directories or files

Copy the created "example.txt" document to the parent directory.

$cp example.txt ../
			

cp

Check if the document was copied successfully.

$ls ../
			

ls ../

rm: delete the directories or files

Delete the "example.txt" document in the tools directory.

$rm example.txt
			

rm example.txt

Check if the document has been deleted successfully.

$ls
			

ls

mv: move or rename

Move the "example.txt" document in the parent directory to the current directory.

$mv ../example.txt ./
			

mv ../example.txt ./

Check if the document was moved successfully.

$ls ../
$ls

ls

Rename the "example.txt" document to "examp2.txt".

$mv example.txt examp2.txt
			

mv example.txt examp2.txt

View the results of the rename.

$ls
			

ls

wget: Download

Use the wget tool to download the genomic assembly software “AbySS” to the tools directory.

$wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/2.1.5/abyss-2.1.5.tar.gz
			

wget

tar: compression/decompression

The file in "tar.gz" format can be decompressed with "tar zxvf", decompress the "abyss-2.1.5.tar.gz" that you just downloaded.

$tar zxvf abyss-2.1.5.tar.gz
			

tar

top: View system processes

$top
			

top

Press the key "q" to exit. Ubuntu also comes with a more intuitive tool "htop" for viewing system processes.

$htop
			

htop

(E) Environment Variables

It is often necessary to set environment variables when installing the software. The so-called environment variable is to tell the computer where the software was installed. The file that stores the environment variable is hidden in the user's home directory and can be viewed with the "ls -a" command.

$ls -a ~/
			

ls -a ~/

".bashrc" and ".profile" are both environment variable configuration files, usually, we only need to edit ".bashrc".

(F) Software Installation on Linux

(i) Source code compilation and installation

The source code installation is suitable for all Linux distributions and macOS. Take the "AbySS" genome assembly software as an example to demonstrate the source code compilation and installation, a total of three steps are required: configuration (./configure), compilation (make) and installation (sudo make install). First, enter the "AbySS" software directory, and view the files in the directory, find the configuration file "configure", and configure the software according to the instructions in "README.md".

$cd abyss-2.1.5/
$ls

cd abyss-2.1.5

"./configure" means running configure for pre-installation configuration.

$./configure
			

Execute compilation

$make
			

To install, you need the "sudo" command to provide write access to the system directory.

$sudo make install
			

Note: The above steps only show the general installation method, but the "AbySS" software relies on some other software. Users need to install the dependency packages first, and finally install "AbySS", otherwise the installation will fail.

(ii) Install software through the package management tool

Different Linux distributions have their own package manager. Currently used Linux distributions are mainly based on "RedHat" and "Debian". The package manager of the RedHat series is "yum". The method is to enter "sudo yum install -y software-name" in the terminal. The package manager of the Debian series is "apt-get". The method is to enter "sudo apt-get install software-name" in the terminal.

Example: Install AbySS software in Ubuntu via apt-get, enter the command and password, and then enter "Y" as prompted and press Enter to auto-install.

$sudo apt-get install abyss
			

sudo apt-get install abyss sudo apt-get install abyss

(iii) Install the software by editing environment variables

The prokaryotic gene prediction software "Prodigal" was used as an example. First, find the prodigal source codes on Github, click on "Clone or download" and copy the git address.

git

Enter the tools directory in the terminal and enter the clone command to clone the project to the local computer. The command formula is "git clone link".

$git clone https://github.com/hyattpd/Prodigal.git
			

git clone

Go to the "Prodigal" directory after the clone is complete.

$cd Prodigal
			

cd Prodigal

Compilation the software

$make
			

make

If the "gcc" was not installed, there will be an error message shows that the gcc command cannot be found, so users need to install gcc first. After entering the command, enter the password as prompted until the installation is complete.

$sudo apt-get install gcc
			

install gcc

Recompile prodigal

$make
$ls

Recompile prodigal

When the compilation is complete, the executable is obtained, but the system cannot find the path of prodigal, so users need to add its path to the environment variable. Open the environment variable configuration file ".bashrc" via vim.

$vim ~/.bashrc
			

vim ~/.bashrc

Add the configuration statement "export PATH=$PATH:$HOME/tools/Prodigal" at the end of the document. "$HOME" represents the user home directory, and "$HOME/tools/Prodigal" represents the directory where the prodigal executable is located.

environment variable

Save and exit after editing is complete. Then execute the "source ~/.bashrc" command to notify the system that the ".bashrc" document has been changed.

$source ~/.bashrc
			

source ~/.bashrc

Check if the configuration is successful.

$prodigal -h
			

prodigal -h

To add other software to the environment variable, just write their path behind the previous one. The paths of each software are separated by ":", and there must be no spaces.

environment variables

Soft Link is similar to the shortcut of Windows system. Users can store the soft link of the executable program into the system default environment variables, such as "/usr/bin/" or "/usr/local/bin". Still taking the newly compiled prodigal software as an example, the formula for creating a soft link is "sudo ln -s /home/bio/tools/Prodigal/prodigal /usr/local/bin/prodigal". Enter the password as prompted to complete the creation.

$sudo ln -s /home/bio/tools/Prodigal/prodigal /usr/local/bin/prodigal
			

Check if the soft link is created successfully by using the "whereis" command.

$whereis prodigal
			

whereis prodigal

Notice: Users must enter the absolute path when creating a soft link, otherwise you will get the error message "Too many levels of symbolic links".

(v) Install the software via Anaconda Package Manager

Anaconda is a free, platform-agnostic, easy-to-install package manager and environment manager. Bioconda is a channel for the conda package manager specializing in bioinformatics software. When installing the software via conda, all the dependencies will be installed with one command, which saves time and reduces the installation difficulty. Bioconda currently has more than 600 contributors and 500 members, and most bioinformatics software is included. Users can go to the official website to search for the software they need.

Installation of conda

Here, Miniconda will be installed, go to the official website, and select the installation file suitable for your system and python version.

Miniconda

$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
			

wget

start installation

$bash Miniconda3-latest-Linux-x86_64.sh
			

Press Enter when prompted to view the license, enter "yes" and press "Enter" to continue. Press "Enter" to confirm the installation location. Miniconda is installed in the miniconda3 directory under the home directory. Type "yes" and press "Enter" to initialize. Finally, run the "source ~/.bashrc" command.

install miniconda3

Set up bioconda channel

Add the channels by entering the following three commands in the terminal:

$conda config --add channels defaults
$conda config --add channels bioconda
$conda config --add channels conda-forge

Now, bioconda was configured and bioinformatics software can be installed via conda. Next, the mapping software "bwa" will be installed via conda.

$conda install bwa
			

conda install bwa

Enter "y" as prompted to complete the installation.

(G) Install bioinformatics software on MacOS

(i) Source code compilation and installation

The source code installation method of macOS is the same as the Linux installation method.

The configuration method is the same as Linux.

(iii) Install the software by adding environment variables

The macOS environment variable configuration method is the same as that of the Linux, but the configuration file is ".bash_profile" in the home directory. Run the following command to edit it.

$vim ~/.bash_profile
			

After the editing is completed and saved, users need to run the source command.

$source ~/.bash_profile
			

(iv) Install software through the package management tool

The package manager of macOS is Homebrew, and it can be installed with the following command.

$ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
			

Install mapping software "bwa" via Homebrew.

$brew install bwa
			

(v) Configuration Anaconda on macOS

Installation of Miniconda3

$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$sh Miniconda3-latest-MacOSX-x86_64.sh
$source ~/.bash_profile

Add channels of Bioconda

$conda config --add channels defaults
$conda config --add channels bioconda
$conda config --add channels conda-forge

Installation of the software bwa

$conda install bwa
			

2. Sequencing technology

(A) "Next-generation" sequencing

The core idea of the next-generation sequencing technology (NGS) is “Sequencing by Synthesis”, which determines the sequence of DNA by capturing the newly added dNTP. The most widely used platform is mainly Illumina's products. The advantages of NGS include high throughput, scalability, low error rate, low cost, and speed.

Terms in Illumina sequencing:

  • Read length: The number of base pairs (bp) sequenced from a DNA fragment.

  • Insert size: The length of sequence between paired-end adapters.

  • Junction: The unsequenced part in the middle of the insert.

  • Flowcell: The location where the sequencing reaction occurs. One flowcell contains 8 lanes.

  • Lane: There are eight lanes on each flowcell, samples can be loaded on the lane for simultaneous analysis on the Illumina sequencing system.

  • Raw data: Reads with adapters, linkers, low-quality sequences, and short sequences.

  • Clean data: The adapters, linker, low-quality reads, shorter reads and reads produced by ribosomal RNA and ncRNA were removed from Raw data.

  • Data size: (Read length) x (Reads number).

Terms

(B) Third-generation sequencing

Third-generation sequencing, also known as single-molecule sequencing, the sequencing process does not require PCR amplification step and can produce ultra-long reads, which can span regions with high GC content and highly repetitive regions. Currently used sequencing platforms include Pacific Biosciences (PacBio) and Oxford Nanopore.

PacBio uses SMRT Cell as a carrier for sequencing reactions. PacBio also uses the “Sequencing by Synthesis” method, sequencing reactions are carried out in nanopores, and a DNA polymerase and a DNA template are immobilized in a nanopore. The dNTP fluorescence signal is detected during the extension reaction to determine the base sequence.

The Nanopore sequencing technology developed by Oxford is truly real-time sequencing, which is based on electrical signals to determine the bases.

Table 1. Characteristics, strengths and weaknesses of commonly used sequencing platforms (Besser et al. 2018)

Platform\Instrument Throughput (Gb) Read length (bp) Strength Weakness
Sanger sequencing
ABI 3500/3730 0.0003 Up to 1 kb Read accuracy and length Cost and throughput
Illumina
MiniSeq 1.7–7.5 1×75 to ×150 Low initial investment Run and read length
MiSeq 0.3–15 1×36 to 2×300 Read length, scalability Run length
NextSeq 10–120 1×75 to 2×150 Throughput Run and read length
HiSeq (2500) 10–1000 ×50 to ×250 Read accuracy, throughput High initial investment, run
NovaSeq 5000/6000 2000–6000 2×50 to ×150 Read accuracy, throughput High initial investment, run
IonTorrent
PGM 0.08–2 Up to 400 Read length, speed Throughput, homopolymers
S5 0.6–15 Up to 400 Read length, speed Homopolymers
Proton 10–15 Up to 200 Speed, throughput Homopolymers
Pacific BioSciences
PacBio RSII 0.5–1 Up to 60 kb Read length, speed (Average 10 kb, N50 20 kb) High error rate and initial
Sequel 5–10 Up to 60 kb Read length, speed (Average 10 kb, N50 20 kb) High error rate
Oxford Nanopore
MInION 0.1–1 Up to 100 kb Read length, portability High error rate, run length

(C) Common sequence format

Fastq

A FASTQ file is a text file that contains the sequence data from the clusters that pass filter on a flow cell. Reads data are usually in fastq format, and each read contains 4 lines, the first line describes information about sequencing, the second line contains nucleotide sequence, the third line generally has no information, and the fourth line shows the sequencing quality of each base in the second line.

FASTQ

Table 2. Descriptions of the first line of the fastq file

Strings Description
@ST-E00310 The unique instrument name
147 The run id
HVT25CCXX The flowcell id
3 Flowcell lane
1011 The number within the flowcell lane
13382 ‘x’-coordinate of the cluster within the title
1819 ‘y’-coordinate of the cluster within the title
1 The number of a pair, 1 or 2 (paired-end or mate-pair reads only)
N Y if the read fails filter (read is bad), N otherwise
0 0 when none of the control bits are on, otherwise it is an even number
TGAAGACA Index sequence

Fasta

The FASTA format is a text-based format for representing nucleotide sequences or amino acid sequences. Each sequence including two parts, the first part contains ID of the sequence with ">" at the start, the second part contains the nucleotide sequences or amino acid sequences.

FASTA

Genbank

GenBank format stores sequence and its annotation together.

Genbank

GFF3

GFF3 (Generic Feature Format version 3) describes features for biological sequences. Each line consisting of 9 tab-delimited columns.

GFF3

(D) Basic concepts of genome assembly

Sequencing depth: Sequencing depth refers to the ratio of the total number of bases (read length x reads number) obtained by sequencing to the haploid genome length.

Coverage: Coverage refers to the proportion of sequences obtained by sequencing to the whole genome. Due to the presence of complex regions such as high GC and repetitive sequences in the genome, sequences obtained by sequencing often fail to cover all regions of the genome. For example, the coverage is 96%, indicating that 4% of the genome regions were not sequenced.

Read, Contig, and Scaffold: The sequences obtained by sequencing were called reads. the sequence assembled from reads according to their overlap is known as contig. The contigs are arranged in order according to the pair ends information to obtain a scaffold.

N50: N50 is a measure to describe the quality of assembled genomes that are fragmented in contigs of different length. The N50 is defined as the minimum contig length needed to cover 50% of the genome.

References

Besser J, Carleton HA, Gerner-Smidt P, Lindsey RL, Trees E. Next-generation sequencing technologies and their application to the study and control of bacterial infections. Clinical Microbiology and Infection, 2018, 24: 335-341

Copyright © 2008-2019 State Key Laboratory of Agricultural Microbiology, Wuhan, China
Email:liaochenlanruo@webmail.hzau.edu.cn