Getting Started with SBSM
This tutorial provides a complete step-by-step guide to using the SBSM algorithm for sequence classification tasks. By the end of this guide, you will understand how to prepare data, train a model using SBSM, and evaluate the results.
Download the Example Project
Before starting, please download the example project, which contains the necessary input files including training and test sequences, as well as their corresponding labels. These files are required for running the code in the following steps.
The following files should be placed in your working directory:
File Name | Description |
---|---|
example.py |
A example script showing how to use the SBSM algorithm. See Example 1. |
example_cross_validation.py |
A example script showing how to use the SBSM algorithm with cross-validation. See Example 2. |
train_samples.fasta |
Protein sequences used for training, stored in FASTA format. |
train_labels.txt |
Binary classification labels for the training sequences. |
test_samples.fasta |
Protein sequences used for testing, stored in FASTA format. |
test_labels.txt |
Binary classification labels for the test sequences. |
Example 1 (Training and Testing)
This tutorial provides a complete step-by-step guide to using the SBSM algorithm for sequence classification tasks. By the end of this guide, you will understand how to prepare data, train a model using SBSM, and evaluate the results.
Step 1: Load the Data
This step involves reading nucleotide sequences and label data from files. The sequences are in FASTA format, while labels are in plain text. The functions below read and prepare the data for further processing.
from sbsm import SBSM
def read_fasta(file_path):
with open(file_path, 'r') as f:
lines = f.readlines()
return [line.strip() for line in lines[1::2]]
def read_label(file_path):
with open(file_path, 'r') as file:
return [int(line.strip()) for line in file.readlines()]
train_sequences = read_fasta("train_samples.fasta")
test_sequences = read_fasta("test_samples.fasta")
train_labels = read_label("train_labels.txt")
test_labels = read_label("test_labels.txt")
Step 2: Initialize the SBSM Model
Here, we define an SBSM model instance and set its hyperparameters, including scoring rules for matches, mismatches, and gaps, as well as the number of neighbors and regularization terms.
sbsm = SBSM(
match_score=2,
mismatch_score=-1,
gap_score=-2,
k_neighbors=15,
lambda_=0.9,
nu1=0.001,
nu2=0.001,
process_num=30
)
Step 3: Compute Kernel Matrices
The SBSM algorithm uses kernel matrices to represent the similarity between sequence samples. In this step, we generate both the training and testing kernel matrices.
train_kernel, test_kernel = sbsm.compute_kernels(
X_train=train_sequences,
X_test=test_sequences,
y_train=train_labels
)
Step 4: Train the Model
Once the kernel matrix is ready, we can train the model using the training data. The fit
function adjusts the model based on the similarity scores and the corresponding labels.
sbsm.fit(
train_kernel,
train_labels,
alpha=1.0,
C=1.0
)
Step 5: Make Predictions
With the trained model, we now use the test kernel matrix to make predictions. The model can return either predicted labels or the associated probabilities.
predictions = sbsm.predict(test_kernel)
probabilities = sbsm.predict_proba(test_kernel)
Step 6: Evaluate the Results
To assess the model's performance, we compute several metrics including True Positive Rate (TPR), True Negative Rate (TNR), Accuracy (ACC), and Matthews Correlation Coefficient (MCC). These metrics are helpful for evaluating binary classification results.
def evaluate_metrics(true_labels, predictions):
TP = FP = TN = FN = 0
for pred, true in zip(predictions, true_labels):
if pred == 1 and true == 1:
TP += 1
elif pred == 1 and true == 0:
FP += 1
elif pred == 0 and true == 0:
TN += 1
elif pred == 0 and true == 1:
FN += 1
TPR = TP / (TP + FN) if TP + FN != 0 else 0
TNR = TN / (TN + FP) if TN + FP != 0 else 0
ACC = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) != 0 else 0
MCC_numerator = TP * TN - FP * FN
MCC_denominator = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))**0.5
MCC = MCC_numerator / MCC_denominator if MCC_denominator != 0 else 0
return TPR, TNR, ACC, MCC
TPR, TNR, ACC, MCC = evaluate_metrics(test_labels, predictions)
print(f"TPR:{TPR}, TNR:{TNR}, ACC:{ACC}, MCC:{MCC}")
Example 2 (Cross-validation)
Step 1: Load the Dataset
In this step, we load the full dataset for cross-validation. The sequences are in FASTA format, and the labels are stored in a plain text file.
def read_fasta(file_path):
with open(file_path, 'r') as f:
lines = f.readlines()
return [line.strip() for line in lines[1::2]]
def read_label(file_path):
with open(file_path, 'r') as file:
return [int(line.strip()) for line in file.readlines()]
sequences = read_fasta("train_samples.fasta")
labels = read_label("train_labels.txt")
Step 2: Initialize the SBSM Model
Here we initialize the SBSM model with predefined hyperparameters. These parameters control sequence alignment scoring and the regularization strength.
from sbsm import SBSM
sbsm = SBSM(
match_score=2,
mismatch_score=-1,
gap_score=-2,
k_neighbors=15,
lambda_=0.9,
nu1=0.001,
nu2=0.001,
process_num=30
)
Step 3: Compute Cross-Validation Kernel Matrices
This step uses the SBSM method compute_kernels_cv
to generate kernel matrices for each fold of the cross-validation. It divides the dataset into k
parts and calculates the similarity matrices needed for training and testing in each fold.
cv_kernel = sbsm.compute_kernels_cv(
X_train=sequences,
y_train=labels,
k_fold=5,
shuffle=True,
random_state=0,
)
Step 4: Train and Evaluate Each Fold
We now loop through each fold, train the model, make predictions, and compute evaluation metrics. Results for each fold are printed individually.
import numpy as np
def evaluate_metrics(true_labels, predictions):
TP = FP = TN = FN = 0
for pred, true in zip(predictions, true_labels):
if pred == 1 and true == 1:
TP += 1
elif pred == 1 and true == 0:
FP += 1
elif pred == 0 and true == 0:
TN += 1
elif pred == 0 and true == 1:
FN += 1
TPR = TP / (TP + FN) if TP + FN != 0 else 0
TNR = TN / (TN + FP) if TN + FP != 0 else 0
ACC = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) != 0 else 0
MCC_numerator = TP * TN - FP * FN
MCC_denominator = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))**0.5
MCC = MCC_numerator / MCC_denominator if MCC_denominator != 0 else 0
return TPR, TNR, ACC, MCC
tpr_list, tnr_list, acc_list, mcc_list = [], [], [], []
for i, (K_train, K_test, y_train, y_test) in enumerate(cv_kernel):
sbsm.fit(K_train, y_train, alpha=0.5, C=64.0)
predictions = sbsm.predict(K_test)
TPR, TNR, ACC, MCC = evaluate_metrics(y_test, predictions)
print(f"Fold{i}: TPR:{TPR:.4f}, TNR:{TNR:.4f}, ACC:{ACC:.4f}, MCC:{MCC:.4f}")
tpr_list.append(TPR)
tnr_list.append(TNR)
acc_list.append(ACC)
mcc_list.append(MCC)
Step 5: Evaluate the Results
After evaluating all folds, we compute and print the mean scores for each metric across all cross-validation iterations. These values provide an overall estimate of model performance.
print(f" Mean: TPR:{np.mean(tpr_list):.4f}, TNR:{np.mean(tnr_list):.4f}, ACC:{np.mean(acc_list):.4f}, MCC:{np.mean(mcc_list):.4f}")