Getting Started with SBSM

This tutorial provides a complete step-by-step guide to using the SBSM algorithm for sequence classification tasks. By the end of this guide, you will understand how to prepare data, train a model using SBSM, and evaluate the results.

Download the Example Project

Before starting, please download the example project, which contains the necessary input files including training and test sequences, as well as their corresponding labels. These files are required for running the code in the following steps.

📦 Download Example Project

The following files should be placed in your working directory:

File Name Description
example.py A example script showing how to use the SBSM algorithm. See Example 1.
example_cross_validation.py A example script showing how to use the SBSM algorithm with cross-validation. See Example 2.
train_samples.fasta Protein sequences used for training, stored in FASTA format.
train_labels.txt Binary classification labels for the training sequences.
test_samples.fasta Protein sequences used for testing, stored in FASTA format.
test_labels.txt Binary classification labels for the test sequences.

Example 1 (Training and Testing)

This tutorial provides a complete step-by-step guide to using the SBSM algorithm for sequence classification tasks. By the end of this guide, you will understand how to prepare data, train a model using SBSM, and evaluate the results.

Step 1: Load the Data

This step involves reading nucleotide sequences and label data from files. The sequences are in FASTA format, while labels are in plain text. The functions below read and prepare the data for further processing.

from sbsm import SBSM

def read_fasta(file_path):
    with open(file_path, 'r') as f:
        lines = f.readlines()
        return [line.strip() for line in lines[1::2]]

def read_label(file_path):
    with open(file_path, 'r') as file:
        return [int(line.strip()) for line in file.readlines()]

train_sequences = read_fasta("train_samples.fasta")
test_sequences = read_fasta("test_samples.fasta")
train_labels = read_label("train_labels.txt")
test_labels = read_label("test_labels.txt")

Step 2: Initialize the SBSM Model

Here, we define an SBSM model instance and set its hyperparameters, including scoring rules for matches, mismatches, and gaps, as well as the number of neighbors and regularization terms.

sbsm = SBSM(
    match_score=2,
    mismatch_score=-1,
    gap_score=-2,
    k_neighbors=15,
    lambda_=0.9,
    nu1=0.001,
    nu2=0.001,
    process_num=30
)

Step 3: Compute Kernel Matrices

The SBSM algorithm uses kernel matrices to represent the similarity between sequence samples. In this step, we generate both the training and testing kernel matrices.

train_kernel, test_kernel = sbsm.compute_kernels(
    X_train=train_sequences,
    X_test=test_sequences,
    y_train=train_labels
)

Step 4: Train the Model

Once the kernel matrix is ready, we can train the model using the training data. The fit function adjusts the model based on the similarity scores and the corresponding labels.

sbsm.fit(
    train_kernel,
    train_labels,
    alpha=1.0,
    C=1.0
)

Step 5: Make Predictions

With the trained model, we now use the test kernel matrix to make predictions. The model can return either predicted labels or the associated probabilities.

predictions = sbsm.predict(test_kernel)
probabilities = sbsm.predict_proba(test_kernel)

Step 6: Evaluate the Results

To assess the model's performance, we compute several metrics including True Positive Rate (TPR), True Negative Rate (TNR), Accuracy (ACC), and Matthews Correlation Coefficient (MCC). These metrics are helpful for evaluating binary classification results.

def evaluate_metrics(true_labels, predictions):
    TP = FP = TN = FN = 0
    for pred, true in zip(predictions, true_labels):
        if pred == 1 and true == 1:
            TP += 1
        elif pred == 1 and true == 0:
            FP += 1
        elif pred == 0 and true == 0:
            TN += 1
        elif pred == 0 and true == 1:
            FN += 1

    TPR = TP / (TP + FN) if TP + FN != 0 else 0
    TNR = TN / (TN + FP) if TN + FP != 0 else 0
    ACC = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) != 0 else 0
    MCC_numerator = TP * TN - FP * FN
    MCC_denominator = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))**0.5
    MCC = MCC_numerator / MCC_denominator if MCC_denominator != 0 else 0

    return TPR, TNR, ACC, MCC

TPR, TNR, ACC, MCC = evaluate_metrics(test_labels, predictions)
print(f"TPR:{TPR}, TNR:{TNR}, ACC:{ACC}, MCC:{MCC}")

Example 2 (Cross-validation)

Step 1: Load the Dataset

In this step, we load the full dataset for cross-validation. The sequences are in FASTA format, and the labels are stored in a plain text file.

def read_fasta(file_path):
    with open(file_path, 'r') as f:
        lines = f.readlines()
        return [line.strip() for line in lines[1::2]]

def read_label(file_path):
    with open(file_path, 'r') as file:
        return [int(line.strip()) for line in file.readlines()]

sequences = read_fasta("train_samples.fasta")
labels = read_label("train_labels.txt")

Step 2: Initialize the SBSM Model

Here we initialize the SBSM model with predefined hyperparameters. These parameters control sequence alignment scoring and the regularization strength.

from sbsm import SBSM

sbsm = SBSM(
    match_score=2,
    mismatch_score=-1,
    gap_score=-2,
    k_neighbors=15,
    lambda_=0.9,
    nu1=0.001,
    nu2=0.001,
    process_num=30
)

Step 3: Compute Cross-Validation Kernel Matrices

This step uses the SBSM method compute_kernels_cv to generate kernel matrices for each fold of the cross-validation. It divides the dataset into k parts and calculates the similarity matrices needed for training and testing in each fold.

cv_kernel = sbsm.compute_kernels_cv(
    X_train=sequences,
    y_train=labels,
    k_fold=5,
    shuffle=True,
    random_state=0,
)

Step 4: Train and Evaluate Each Fold

We now loop through each fold, train the model, make predictions, and compute evaluation metrics. Results for each fold are printed individually.

import numpy as np

def evaluate_metrics(true_labels, predictions):
    TP = FP = TN = FN = 0
    for pred, true in zip(predictions, true_labels):
        if pred == 1 and true == 1:
            TP += 1
        elif pred == 1 and true == 0:
            FP += 1
        elif pred == 0 and true == 0:
            TN += 1
        elif pred == 0 and true == 1:
            FN += 1

    TPR = TP / (TP + FN) if TP + FN != 0 else 0
    TNR = TN / (TN + FP) if TN + FP != 0 else 0
    ACC = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) != 0 else 0
    MCC_numerator = TP * TN - FP * FN
    MCC_denominator = ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))**0.5
    MCC = MCC_numerator / MCC_denominator if MCC_denominator != 0 else 0

    return TPR, TNR, ACC, MCC

tpr_list, tnr_list, acc_list, mcc_list = [], [], [], []

for i, (K_train, K_test, y_train, y_test) in enumerate(cv_kernel):
    sbsm.fit(K_train, y_train, alpha=0.5, C=64.0)
    predictions = sbsm.predict(K_test)
    TPR, TNR, ACC, MCC = evaluate_metrics(y_test, predictions)
    print(f"Fold{i}: TPR:{TPR:.4f}, TNR:{TNR:.4f}, ACC:{ACC:.4f}, MCC:{MCC:.4f}")
    tpr_list.append(TPR)
    tnr_list.append(TNR)
    acc_list.append(ACC)
    mcc_list.append(MCC)

Step 5: Evaluate the Results

After evaluating all folds, we compute and print the mean scores for each metric across all cross-validation iterations. These values provide an overall estimate of model performance.

print(f" Mean: TPR:{np.mean(tpr_list):.4f}, TNR:{np.mean(tnr_list):.4f}, ACC:{np.mean(acc_list):.4f}, MCC:{np.mean(mcc_list):.4f}")