BiLSTM COVID-19 Tweet Classifier

Posted May 10, 2024 Updated Dec 30, 2025

By Rishi Vemulapalli

6 min read

Overview

This project implements a BiLSTM-based binary classifier to identify whether short social media posts about COVID-19 are informative or uninformative. The model demonstrates end-to-end text processing: tokenization, vocabulary building, sequence padding, an embedding layer, bidirectional LSTM encoding, and a final classifier head.

Dataset & Preprocessing

Source: short tweet-style texts labeled as INFORMATIVE or UNINFORMATIVE.
Tokenization: simple regex-based tokenizer that lowercases and extracts word tokens.
Vocabulary: built from training texts with a minimum frequency threshold, including special tokens <pad> and <unk>.
Sequences: converted to integer token ids and padded/truncated to a fixed max_len.

The implementation in the project notebook uses the WNUT-style tab-separated files mounted from Google Drive (see ECE_364_Final_project.ipynb). Key preprocessing details from the notebook:

File loading: pd.read_csv(..., sep='\t', names=["Id","Text","Label"]) and an initial clean where the first CSV row is dropped for the training file.
Label mapping: {"UNINFORMATIVE": 0, "INFORMATIVE": 1} is applied across train/valid/test splits.
Tokenizer: re.findall(r"\\w+", text.lower()) — this preserves alphanumeric tokens and removes punctuation.
Vocabulary building: counts words over the training split and adds tokens with frequency >= min_freq (the notebook uses min_freq=2). The vocab starts with {"<pad>":0, "<unk>":1}.
Padding: sequences are padded with 0 (<pad>) to a uniform max_len (the notebook uses max_len = 45). Longer sequences are truncated from the end.

Technical Walkthrough

The following walkthrough explains the exact data flow, tensor shapes, and algorithmic steps so a reader can understand the model without opening the notebook or code files.

Tokenization example:

Input:  "COVID updates: 10 new cases!"
Tokenizer (regex \w+): ['covid', 'updates', '10', 'new', 'cases']

Vocabulary example (built from training data, indices shown):

{'<pad>': 0, '<unk>': 1, 'covid': 2, 'cases': 3, 'vaccine': 4, 'hospital': 5, ...}

Text -> sequence -> padded sequence example:

Text tokens: ['covid', 'updates', '10', 'new', 'cases']
Sequence (ids): [2, 17, 42, 11, 3]
Padded to max_len=8: [2, 17, 42, 11, 3, 0, 0, 0]

Tensor shapes (per batch):

input_ids shape: [batch_size, seq_len]                # e.g., [32, 45]
after embedding: [batch_size, seq_len, embed_dim]     # e.g., [32, 45, 75]
LSTM output: [batch_size, seq_len, hidden_dim*2]      # e.g., [32, 45, 192]
select last timestep: [batch_size, hidden_dim*2]      # e.g., [32, 192]
final logits: [batch_size, num_classes]               # e.g., [32, 2]

Loss and target shapes:

# criterion = nn.CrossEntropyLoss()
# inputs: logits shape [batch_size, num_classes]
# targets: labels shape [batch_size] (dtype long), values in {0,1}

Training loop (pseudocode):

  
for epoch in range(epochs):
    model.train()
    for input_ids, labels in train_loader:           # input_ids: [B, L], labels: [B]
        optimizer.zero_grad()
        logits = model(input_ids)                    # [B, 2]
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
    # validation (no grad)
    model.eval()
    with torch.no_grad():
        for input_ids, labels in valid_loader:
            logits = model(input_ids)
            # accumulate validation loss

Inference & output CSV format:

After predictions are generated by argmax over logits, the notebook constructs a DataFrame like:

Id,Label
12345,INFORMATIVE
12346,UNINFORMATIVE

The CSV is saved as predictions.csv and can be submitted or inspected. The notebook maps label ints back to strings with {0: 'UNINFORMATIVE', 1: 'INFORMATIVE'}.

Accuracy computation (noting limitations):

accuracy = 1 - (number_of_mismatches / total_rows)

This is a simple overall accuracy; for imbalanced datasets prefer per-class precision/recall/F1.

Model Architecture

The classifier uses a compact BiLSTM architecture:

Embedding layer to convert token ids to dense vectors.
LSTM layer with bidirectional=True to capture left/right context.
Dropout for regularization and a final Linear layer mapping to two logits (binary classification).

Pytorch-style sketch:

  
class Binary_Classifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(p=0.6)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.lstm(x)
        # Use the final output timestep from the LSTM sequence output.
        # With a bidirectional LSTM, the hidden size is doubled, so
        # `x[:, -1, :]` concatenates the last forward and backward outputs.
        x = self.dropout(x[:, -1, :])
        logits = self.fc(x)
        return logits

Notes on dimensions and design choices (from the notebook):

- `embed_dim = 75`, `hidden_dim = 96` (so the linear layer receives `hidden_dim * 2 = 192` features).
- `nn.LSTM(..., batch_first=True, bidirectional=True)` returns `x` shaped `[batch_size, seq_len, hidden_dim*2]`.
- Dropout probability in the notebook is relatively high (`p=0.6`) to reduce overfitting on a small dataset.

Training

Loss: CrossEntropyLoss
Optimizer: Adam with a small weight decay
Batch size and max_len tuned to memory limits (example uses batch_size=32, max_len=45).
Training loop tracks training and validation loss per epoch and returns loss curves for plotting.

Training specifics pulled from the notebook code:

Optimizer: torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4) — weight decay acts as L2 regularization.
Loss: nn.CrossEntropyLoss() — the model outputs raw logits; the loss combines LogSoftmax and NLLLoss internally.
Epochs: notebook example runs epochs = 10 and records train_losses and val_losses lists for visualization.
Training loop detail: gradient zeroing optimizer.zero_grad(), forward pass, loss.backward(), optimizer.step(); validation run inside torch.no_grad().

Example training call from the notebook:

  
model = Binary_Classifier(len(vocab), embed_dim=75, hidden_dim=96)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
train_losses, val_losses = train_model(model, train_loader, valid_loader, epochs=10, criterion=criterion, optimizer=optimizer)

Evaluation & Outputs

Predictions are produced by taking argmax over the model logits.
The notebook maps numeric predictions back to labels (0: UNINFORMATIVE, 1: INFORMATIVE) and compares them to ground truth to compute accuracy.
Loss curves are plotted for visual inspection of training/validation behavior.

Additional evaluation details from the notebook:

During inference the model runs model.eval() and batches are processed without gradients. Predicted class indices are torch.argmax(outputs, dim=1) and collected across the test set.
The notebook constructs a result DataFrame result_test_df, maps predictions back to string labels and computes a simple accuracy metric as:

  
different_rows = len(result_test_df[result_test_df["Label"] != result_test_df["predicted Label"]])
total_rows = len(result_test_df)
accuracy = 1 - different_rows / total_rows

Predictions are exported to CSV using the notebook’s path in Drive: /content/drive/MyDrive/ECE 364 Final Project/predictions.csv.

Notes & Next Steps

This implementation is intentionally compact for experimentation. Possible improvements:
- Use pretrained embeddings (GloVe / fastText) or Transformers for better performance.
- Add class weighting or focal loss if labels are imbalanced.
- Evaluate with precision/recall/F1 and confusion matrices.

Practical next steps and small engineering improvements consistent with the repository:

Save and load model weights with torch.save(model.state_dict(), 'model.pth') and model.load_state_dict(torch.load('model.pth')) for reproducible inference.
Add precision/recall/F1 reporting using sklearn.metrics rather than only accuracy to surface class-specific performance.
Replace the simple tokenizer/vocab with a subword tokenizer (SentencePiece) or use a Transformer encoder (e.g., distilbert) for better generalization on noisy social media text.
Add a small requirements.txt or environment.yml to pin package versions used during evaluation.

Applications

This post is licensed under CC BY 4.0 by the author.