BiLSTM COVID-19 Tweet Classifier
Overview
This project implements a BiLSTM-based binary classifier to identify whether short social media posts about COVID-19 are informative or uninformative. The model demonstrates end-to-end text processing: tokenization, vocabulary building, sequence padding, an embedding layer, bidirectional LSTM encoding, and a final classifier head.
Dataset & Preprocessing
- Source: short tweet-style texts labeled as
INFORMATIVEorUNINFORMATIVE. - Tokenization: simple regex-based tokenizer that lowercases and extracts word tokens.
- Vocabulary: built from training texts with a minimum frequency threshold, including special tokens
<pad>and<unk>. - Sequences: converted to integer token ids and padded/truncated to a fixed
max_len.
The implementation in the project notebook uses the WNUT-style tab-separated files mounted from Google Drive (see ECE_364_Final_project.ipynb). Key preprocessing details from the notebook:
- File loading:
pd.read_csv(..., sep='\t', names=["Id","Text","Label"])and an initial clean where the first CSV row is dropped for the training file. - Label mapping:
{"UNINFORMATIVE": 0, "INFORMATIVE": 1}is applied across train/valid/test splits. - Tokenizer:
re.findall(r"\\w+", text.lower())— this preserves alphanumeric tokens and removes punctuation. - Vocabulary building: counts words over the training split and adds tokens with frequency >=
min_freq(the notebook usesmin_freq=2). The vocab starts with{"<pad>":0, "<unk>":1}. - Padding: sequences are padded with
0(<pad>) to a uniformmax_len(the notebook usesmax_len = 45). Longer sequences are truncated from the end.
Technical Walkthrough
The following walkthrough explains the exact data flow, tensor shapes, and algorithmic steps so a reader can understand the model without opening the notebook or code files.
- Tokenization example:
1
2
Input: "COVID updates: 10 new cases!"
Tokenizer (regex \w+): ['covid', 'updates', '10', 'new', 'cases']
- Vocabulary example (built from training data, indices shown):
1
{'<pad>': 0, '<unk>': 1, 'covid': 2, 'cases': 3, 'vaccine': 4, 'hospital': 5, ...}
- Text -> sequence -> padded sequence example:
1
2
3
Text tokens: ['covid', 'updates', '10', 'new', 'cases']
Sequence (ids): [2, 17, 42, 11, 3]
Padded to max_len=8: [2, 17, 42, 11, 3, 0, 0, 0]
- Tensor shapes (per batch):
1
2
3
4
5
input_ids shape: [batch_size, seq_len] # e.g., [32, 45]
after embedding: [batch_size, seq_len, embed_dim] # e.g., [32, 45, 75]
LSTM output: [batch_size, seq_len, hidden_dim*2] # e.g., [32, 45, 192]
select last timestep: [batch_size, hidden_dim*2] # e.g., [32, 192]
final logits: [batch_size, num_classes] # e.g., [32, 2]
- Loss and target shapes:
1
2
3
# criterion = nn.CrossEntropyLoss()
# inputs: logits shape [batch_size, num_classes]
# targets: labels shape [batch_size] (dtype long), values in {0,1}
- Training loop (pseudocode):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
for epoch in range(epochs):
model.train()
for input_ids, labels in train_loader: # input_ids: [B, L], labels: [B]
optimizer.zero_grad()
logits = model(input_ids) # [B, 2]
loss = criterion(logits, labels)
loss.backward()
optimizer.step()
# validation (no grad)
model.eval()
with torch.no_grad():
for input_ids, labels in valid_loader:
logits = model(input_ids)
# accumulate validation loss
- Inference & output CSV format:
After predictions are generated by argmax over logits, the notebook constructs a DataFrame like:
1
2
3
Id,Label
12345,INFORMATIVE
12346,UNINFORMATIVE
The CSV is saved as predictions.csv and can be submitted or inspected. The notebook maps label ints back to strings with {0: 'UNINFORMATIVE', 1: 'INFORMATIVE'}.
- Accuracy computation (noting limitations):
1
accuracy = 1 - (number_of_mismatches / total_rows)
This is a simple overall accuracy; for imbalanced datasets prefer per-class precision/recall/F1.
Model Architecture
The classifier uses a compact BiLSTM architecture:
Embeddinglayer to convert token ids to dense vectors.LSTMlayer withbidirectional=Trueto capture left/right context.Dropoutfor regularization and a finalLinearlayer mapping to two logits (binary classification).
Pytorch-style sketch:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class Binary_Classifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes=2):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
self.dropout = nn.Dropout(p=0.6)
self.fc = nn.Linear(hidden_dim * 2, num_classes)
def forward(self, x):
x = self.embedding(x)
x, _ = self.lstm(x)
# Use the final output timestep from the LSTM sequence output.
# With a bidirectional LSTM, the hidden size is doubled, so
# `x[:, -1, :]` concatenates the last forward and backward outputs.
x = self.dropout(x[:, -1, :])
logits = self.fc(x)
return logits
1
2
3
4
5
Notes on dimensions and design choices (from the notebook):
- `embed_dim = 75`, `hidden_dim = 96` (so the linear layer receives `hidden_dim * 2 = 192` features).
- `nn.LSTM(..., batch_first=True, bidirectional=True)` returns `x` shaped `[batch_size, seq_len, hidden_dim*2]`.
- Dropout probability in the notebook is relatively high (`p=0.6`) to reduce overfitting on a small dataset.
Training
- Loss:
CrossEntropyLoss - Optimizer:
Adamwith a small weight decay - Batch size and
max_lentuned to memory limits (example usesbatch_size=32,max_len=45). - Training loop tracks training and validation loss per epoch and returns loss curves for plotting.
Training specifics pulled from the notebook code:
- Optimizer:
torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)— weight decay acts as L2 regularization. - Loss:
nn.CrossEntropyLoss()— the model outputs raw logits; the loss combinesLogSoftmaxandNLLLossinternally. - Epochs: notebook example runs
epochs = 10and recordstrain_lossesandval_losseslists for visualization. - Training loop detail: gradient zeroing
optimizer.zero_grad(), forward pass,loss.backward(),optimizer.step(); validation run insidetorch.no_grad().
Example training call from the notebook:
1
2
3
4
model = Binary_Classifier(len(vocab), embed_dim=75, hidden_dim=96)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
train_losses, val_losses = train_model(model, train_loader, valid_loader, epochs=10, criterion=criterion, optimizer=optimizer)
Evaluation & Outputs
- Predictions are produced by taking
argmaxover the model logits. - The notebook maps numeric predictions back to labels (
0: UNINFORMATIVE,1: INFORMATIVE) and compares them to ground truth to compute accuracy. - Loss curves are plotted for visual inspection of training/validation behavior.
Additional evaluation details from the notebook:
- During inference the model runs
model.eval()and batches are processed without gradients. Predicted class indices aretorch.argmax(outputs, dim=1)and collected across the test set. - The notebook constructs a result DataFrame
result_test_df, maps predictions back to string labels and computes a simple accuracy metric as:
1
2
3
different_rows = len(result_test_df[result_test_df["Label"] != result_test_df["predicted Label"]])
total_rows = len(result_test_df)
accuracy = 1 - different_rows / total_rows
- Predictions are exported to CSV using the notebook’s path in Drive:
/content/drive/MyDrive/ECE 364 Final Project/predictions.csv.
Notes & Next Steps
- This implementation is intentionally compact for experimentation. Possible improvements:
- Use pretrained embeddings (GloVe / fastText) or Transformers for better performance.
- Add class weighting or focal loss if labels are imbalanced.
- Evaluate with precision/recall/F1 and confusion matrices.
Practical next steps and small engineering improvements consistent with the repository:
- Save and load model weights with
torch.save(model.state_dict(), 'model.pth')andmodel.load_state_dict(torch.load('model.pth'))for reproducible inference. - Add precision/recall/F1 reporting using
sklearn.metricsrather than only accuracy to surface class-specific performance. - Replace the simple tokenizer/vocab with a subword tokenizer (SentencePiece) or use a Transformer encoder (e.g.,
distilbert) for better generalization on noisy social media text. - Add a small
requirements.txtorenvironment.ymlto pin package versions used during evaluation.