This article introduces you to the basics of Linux, common commands usage, and software installation, and the whole-genome sequencing technology was also introduced to help you get started with bioinformatics.
Root directory: The root directory is the top-level directory of a file system, which was represented by "/", and all other directories are subdirectories of the root directory.
Home directory: A home directory, represented by "~/", is the directory that serves as the repository for a user's personal files, directories, and programs. It is also the directory that a user is first in after logging into the system.
Absolute path: An absolute path, also known as the full path, refers to the absolute location of a directory or file, and is the full path starting from the root directory, such as "/home/bio/". The absolute path of the current directory can be obtained by the "pwd" command.
Relative path: A relative path is a way to specify the location of a directory relative to another directory. The relative path does not need to start from the root directory.
Current directory: The current path is represented by "./".
Parent directory: The "../" represents the parent directory, "../../" represents parent directory of the parent directory, and so on.
(i) Documents and directories need to be named in English, with letters, numbers, and underscores;
(ii) Avoid using spaces in the names of documents and directories;
(iii) The name is case sensitive.
A terminal is a tool for running Linux commands, similar to the command-line tool of Windows.
When remotely manipulating a Linux server, users can use third-party terminal tools such as PuTTy software.
File transfer between the local computer and the remote server can be done via an FTP software, such as FileZilla. Enter the server's IP address, username, password, and port to sign in the server. If the server uses the FTP protocol, the port should be 21. If the SFTP protocol is used, the port is set to 22.
pwd: Get the absolute path of the current location
$pwd
mkdir: Create a directory
$mkdir tools
ls: View the contents of the current directory
$ls
View all directories and files (including hidden contents)
$ls -a
View the contents of the root directory
$ls /
View the contents of the home directory
$ls ~/
cd: Switch path
Go to the "tools" directory
$cd tools
vim: create/edit a document All of the following operations must be performed in English input mode. Firstly, create a new document named "example.txt" and enter some words.
$vim example.txt
The user is unable to enter content until the "i" key was pressed to switch to input mode. When "--INSERT --" appears in the lower-left corner, users can enter text.
When the input is completed, users need to press the “ESC” button to exit the edit mode. At this time, the word “--INSERT --” disappears.
Press "shift + ;" to switch to the vim mode, and a ":" appears in the lower-left corner.
Enter "wq!" to save your changes and exit.
Check if the created file is in the directory.
$ls
cp: copy directories or files
Copy the created "example.txt" document to the parent directory.
$cp example.txt ../
Check if the document was copied successfully.
$ls ../
rm: delete the directories or files
Delete the "example.txt" document in the tools directory.
$rm example.txt
Check if the document has been deleted successfully.
$ls
mv: move or rename
Move the "example.txt" document in the parent directory to the current directory.
$mv ../example.txt ./
Check if the document was moved successfully.
$ls ../
$ls
Rename the "example.txt" document to "examp2.txt".
$mv example.txt examp2.txt
View the results of the rename.
$ls
wget: Download
Use the wget tool to download the genomic assembly software “AbySS” to the tools directory.
$wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/2.1.5/abyss-2.1.5.tar.gz
tar: compression/decompression
The file in "tar.gz" format can be decompressed with "tar zxvf", decompress the "abyss-2.1.5.tar.gz" that you just downloaded.
$tar zxvf abyss-2.1.5.tar.gz
top: View system processes
$top
Press the key "q" to exit. Ubuntu also comes with a more intuitive tool "htop" for viewing system processes.
$htop
It is often necessary to set environment variables when installing the software. The so-called environment variable is to tell the computer where the software was installed. The file that stores the environment variable is hidden in the user's home directory and can be viewed with the "ls -a" command.
$ls -a ~/
".bashrc" and ".profile" are both environment variable configuration files, usually, we only need to edit ".bashrc".
The source code installation is suitable for all Linux distributions and macOS. Take the "AbySS" genome assembly software as an example to demonstrate the source code compilation and installation, a total of three steps are required: configuration (./configure), compilation (make) and installation (sudo make install). First, enter the "AbySS" software directory, and view the files in the directory, find the configuration file "configure", and configure the software according to the instructions in "README.md".
$cd abyss-2.1.5/
$ls
"./configure" means running configure for pre-installation configuration.
$./configure
Execute compilation
$make
To install, you need the "sudo" command to provide write access to the system directory.
$sudo make install
Note: The above steps only show the general installation method, but the "AbySS" software relies on some other software. Users need to install the dependency packages first, and finally install "AbySS", otherwise the installation will fail.
Different Linux distributions have their own package manager. Currently used Linux distributions are mainly based on "RedHat" and "Debian". The package manager of the RedHat series is "yum". The method is to enter "sudo yum install -y software-name" in the terminal. The package manager of the Debian series is "apt-get". The method is to enter "sudo apt-get install software-name" in the terminal.
Example: Install AbySS software in Ubuntu via apt-get, enter the command and password, and then enter "Y" as prompted and press Enter to auto-install.
$sudo apt-get install abyss
The prokaryotic gene prediction software "Prodigal" was used as an example. First, find the prodigal source codes on Github, click on "Clone or download" and copy the git address.
Enter the tools directory in the terminal and enter the clone command to clone the project to the local computer. The command formula is "git clone link".
$git clone https://github.com/hyattpd/Prodigal.git
Go to the "Prodigal" directory after the clone is complete.
$cd Prodigal
Compilation the software
$make
If the "gcc" was not installed, there will be an error message shows that the gcc command cannot be found, so users need to install gcc first. After entering the command, enter the password as prompted until the installation is complete.
$sudo apt-get install gcc
Recompile prodigal
$make
$ls
When the compilation is complete, the executable is obtained, but the system cannot find the path of prodigal, so users need to add its path to the environment variable. Open the environment variable configuration file ".bashrc" via vim.
$vim ~/.bashrc
Add the configuration statement "export PATH=$PATH:$HOME/tools/Prodigal" at the end of the document. "$HOME" represents the user home directory, and "$HOME/tools/Prodigal" represents the directory where the prodigal executable is located.
Save and exit after editing is complete. Then execute the "source ~/.bashrc" command to notify the system that the ".bashrc" document has been changed.
$source ~/.bashrc
Check if the configuration is successful.
$prodigal -h
To add other software to the environment variable, just write their path behind the previous one. The paths of each software are separated by ":", and there must be no spaces.
Soft Link is similar to the shortcut of Windows system. Users can store the soft link of the executable program into the system default environment variables, such as "/usr/bin/" or "/usr/local/bin". Still taking the newly compiled prodigal software as an example, the formula for creating a soft link is "sudo ln -s /home/bio/tools/Prodigal/prodigal /usr/local/bin/prodigal". Enter the password as prompted to complete the creation.
$sudo ln -s /home/bio/tools/Prodigal/prodigal /usr/local/bin/prodigal
Check if the soft link is created successfully by using the "whereis" command.
$whereis prodigal
Notice: Users must enter the absolute path when creating a soft link, otherwise you will get the error message "Too many levels of symbolic links".
Anaconda is a free, platform-agnostic, easy-to-install package manager and environment manager. Bioconda is a channel for the conda package manager specializing in bioinformatics software. When installing the software via conda, all the dependencies will be installed with one command, which saves time and reduces the installation difficulty. Bioconda currently has more than 600 contributors and 500 members, and most bioinformatics software is included. Users can go to the official website to search for the software they need.
Installation of conda
Here, Miniconda will be installed, go to the official website, and select the installation file suitable for your system and python version.
$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
start installation
$bash Miniconda3-latest-Linux-x86_64.sh
Press Enter when prompted to view the license, enter "yes" and press "Enter" to continue. Press "Enter" to confirm the installation location. Miniconda is installed in the miniconda3 directory under the home directory. Type "yes" and press "Enter" to initialize. Finally, run the "source ~/.bashrc" command.
Set up bioconda channel
Add the channels by entering the following three commands in the terminal:
$conda config --add channels defaults
$conda config --add channels bioconda
$conda config --add channels conda-forge
Now, bioconda was configured and bioinformatics software can be installed via conda. Next, the mapping software "bwa" will be installed via conda.
$conda install bwa
Enter "y" as prompted to complete the installation.
The source code installation method of macOS is the same as the Linux installation method.
The configuration method is the same as Linux.
The macOS environment variable configuration method is the same as that of the Linux, but the configuration file is ".bash_profile" in the home directory. Run the following command to edit it.
$vim ~/.bash_profile
After the editing is completed and saved, users need to run the source command.
$source ~/.bash_profile
The package manager of macOS is Homebrew, and it can be installed with the following command.
$ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Install mapping software "bwa" via Homebrew.
$brew install bwa
Installation of Miniconda3
$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$sh Miniconda3-latest-MacOSX-x86_64.sh
$source ~/.bash_profile
Add channels of Bioconda
$conda config --add channels defaults
$conda config --add channels bioconda
$conda config --add channels conda-forge
Installation of the software bwa
$conda install bwa
The core idea of the next-generation sequencing technology (NGS) is “Sequencing by Synthesis”, which determines the sequence of DNA by capturing the newly added dNTP. The most widely used platform is mainly Illumina's products. The advantages of NGS include high throughput, scalability, low error rate, low cost, and speed.
Terms in Illumina sequencing:
Read length: The number of base pairs (bp) sequenced from a DNA fragment.
Insert size: The length of sequence between paired-end adapters.
Junction: The unsequenced part in the middle of the insert.
Flowcell: The location where the sequencing reaction occurs. One flowcell contains 8 lanes.
Lane: There are eight lanes on each flowcell, samples can be loaded on the lane for simultaneous analysis on the Illumina sequencing system.
Raw data: Reads with adapters, linkers, low-quality sequences, and short sequences.
Clean data: The adapters, linker, low-quality reads, shorter reads and reads produced by ribosomal RNA and ncRNA were removed from Raw data.
Data size: (Read length) x (Reads number).
Third-generation sequencing, also known as single-molecule sequencing, the sequencing process does not require PCR amplification step and can produce ultra-long reads, which can span regions with high GC content and highly repetitive regions. Currently used sequencing platforms include Pacific Biosciences (PacBio) and Oxford Nanopore.
PacBio uses SMRT Cell as a carrier for sequencing reactions. PacBio also uses the “Sequencing by Synthesis” method, sequencing reactions are carried out in nanopores, and a DNA polymerase and a DNA template are immobilized in a nanopore. The dNTP fluorescence signal is detected during the extension reaction to determine the base sequence.
The Nanopore sequencing technology developed by Oxford is truly real-time sequencing, which is based on electrical signals to determine the bases.
Platform\Instrument | Throughput (Gb) | Read length (bp) | Strength | Weakness |
---|---|---|---|---|
Sanger sequencing | ||||
ABI 3500/3730 | 0.0003 | Up to 1 kb | Read accuracy and length | Cost and throughput |
Illumina | ||||
MiniSeq | 1.7–7.5 | 1×75 to ×150 | Low initial investment | Run and read length |
MiSeq | 0.3–15 | 1×36 to 2×300 | Read length, scalability | Run length |
NextSeq | 10–120 | 1×75 to 2×150 | Throughput | Run and read length |
HiSeq (2500) | 10–1000 | ×50 to ×250 | Read accuracy, throughput | High initial investment, run |
NovaSeq 5000/6000 | 2000–6000 | 2×50 to ×150 | Read accuracy, throughput | High initial investment, run |
IonTorrent | ||||
PGM | 0.08–2 | Up to 400 | Read length, speed | Throughput, homopolymers |
S5 | 0.6–15 | Up to 400 | Read length, speed | Homopolymers |
Proton | 10–15 | Up to 200 | Speed, throughput | Homopolymers |
Pacific BioSciences | ||||
PacBio RSII | 0.5–1 | Up to 60 kb | Read length, speed (Average 10 kb, N50 20 kb) | High error rate and initial |
Sequel | 5–10 | Up to 60 kb | Read length, speed (Average 10 kb, N50 20 kb) | High error rate |
Oxford Nanopore | ||||
MInION | 0.1–1 | Up to 100 kb | Read length, portability | High error rate, run length |
Fastq
A FASTQ file is a text file that contains the sequence data from the clusters that pass filter on a flow cell. Reads data are usually in fastq format, and each read contains 4 lines, the first line describes information about sequencing, the second line contains nucleotide sequence, the third line generally has no information, and the fourth line shows the sequencing quality of each base in the second line.
Strings | Description |
---|---|
@ST-E00310 | The unique instrument name |
147 | The run id |
HVT25CCXX | The flowcell id |
3 | Flowcell lane |
1011 | The number within the flowcell lane |
13382 | ‘x’-coordinate of the cluster within the title |
1819 | ‘y’-coordinate of the cluster within the title |
1 | The number of a pair, 1 or 2 (paired-end or mate-pair reads only) |
N | Y if the read fails filter (read is bad), N otherwise |
0 | 0 when none of the control bits are on, otherwise it is an even number |
TGAAGACA | Index sequence |
Fasta
The FASTA format is a text-based format for representing nucleotide sequences or amino acid sequences. Each sequence including two parts, the first part contains ID of the sequence with ">" at the start, the second part contains the nucleotide sequences or amino acid sequences.
Genbank
GenBank format stores sequence and its annotation together.
GFF3
GFF3 (Generic Feature Format version 3) describes features for biological sequences. Each line consisting of 9 tab-delimited columns.
Sequencing depth: Sequencing depth refers to the ratio of the total number of bases (read length x reads number) obtained by sequencing to the haploid genome length.
Coverage: Coverage refers to the proportion of sequences obtained by sequencing to the whole genome. Due to the presence of complex regions such as high GC and repetitive sequences in the genome, sequences obtained by sequencing often fail to cover all regions of the genome. For example, the coverage is 96%, indicating that 4% of the genome regions were not sequenced.
Read, Contig, and Scaffold: The sequences obtained by sequencing were called reads. the sequence assembled from reads according to their overlap is known as contig. The contigs are arranged in order according to the pair ends information to obtain a scaffold.
N50: N50 is a measure to describe the quality of assembled genomes that are fragmented in contigs of different length. The N50 is defined as the minimum contig length needed to cover 50% of the genome.
Besser J, Carleton HA, Gerner-Smidt P, Lindsey RL, Trees E. Next-generation sequencing technologies and their application to the study and control of bacterial infections. Clinical Microbiology and Infection, 2018, 24: 335-341