Work & Research Experience

Developed geospatial data pipelines, ML models, and wildfire simulation framework to predict future fire risk and assess economic impact under future climate scenios.

Data Scientist Intern – Global Risk Analytics Team

Royal Bank of Canada

  • Data Collection and Dataset Curation: Sourced, cleaned, and analyzed global geospatial data, including fire intensity, FWI, elevation, and land cover from satellite and reanalysis sources; merged and aligned data, and grouped it into fire events using DBSCAN and time-based clustering for ML model training.

  • ML Model Development: Trained kernelized logistic regression and XGBoost models to predict future FWI under various climate scenarios, optimized with Optuna and geospatial cross-validation.

  • Wildfire Simulation: Built a wildfire season simulation framework by modifying source code of CLIMADA library and incorporating ML, probabilistic models and cellular automaton to assess future fire risk, supported enhanced economic impact assessments for wildfire management.

Toronto, 05/2024-08/2024

Data Scientist Intern – Fix Income Team

Guardian Capital

Toronto, 01/2024-05/2024

Developed ML and deep learning pipelines for predicting Non-Farm Payroll (NFP) and US Treasury Yield, focusing on enhancing predictive accuracy and efficiency.

  • Constructed ML Pipeline: Designed and implemented an end-to-end MLpipeline, incorporating various models such as Kernelized Linear Regression, Histogram-based Gradient Boosting Regressor, and Random Forest Regressor, meeting an urgent and competing deadline while maintaining high accuracy and quality.

  • Implemented Advanced Data Processing: Integrated sophisticated data processing methods, utilizing roll-forward partitioning and lagged features to clean and prepare data for optimal model performance.

  • Optimized Model Performance: Conducted hyperparameter optimization using various search methods such as cross validation with halving random search to fine-tune the models and improve predictive accuracy, achieved significant predictive accuracy improvement.

  • Expanding with Deep Learning: utilized pre-trained transformer models (TimeGPT1) to predict NFP time series data, aiming to leverage the latest advancements in artificial intelligence for financial time series forecasting.

NLP (Retrieval Augmented Generation) Research Assistant

Toronto Region Conservation Agency (TRCA)

Developed a Generative Question Answering (QA) System for TRCA's Technical Documents leveraging RAG and LLMs.

  • Engineered a QA System: Designed and implemented a QA system using retrieval-augmented generation, combining large language models with an advanced information retrieval system.

  • Team Management: Managed a team of graduate and undergraduate assistants, overseeing their daily routine research processes and ensured the operation and completeness of projects.

  • Optimized Data Processing: Conducted data cleaning and preprocessing for TRCA's technical documents, ensuring a clean, accessible corpus for the QA system.

  • Innovated in Prompt Engineering: Developed and optimized passage retrieval mechanisms and domain-specific prompt engineering to guide LLMs in generating accurate, context-aware responses.

  • Developing User-Friendly GUI: Spearheaded the creation of an interactive GUI for the QA system to enhance user engagement, accessibility, and personalized interaction through a chatbot interface.

Toronto, 07/2023-Present

NLP (LLM pre-training and fine-tuning) Research Assistant

University of Toronto

Developed the first corpus in construction management domain and pre-trained domain-specific large language models.

  • Corpus Development: Successfully collected, cleaned, and pre-processed data from 732 journal papers to develop a comprehensive domain-specific corpus for construction management.

  • Model training: Engineered an end-to-end pipeline for the pre-training and fine-tuning of domain-specific pre-trained models (PTMs), optimizing with minimal preprocessing and hyperparameter adjustments.

  • Performance Improvement in NLP Tasks: Achieved significant improvements in Text Classification (TC) and Named Entity Recognition (NER) tasks within the construction management systems (CMS) domain, increasing the F1 score by 5.9% and 8.5% to 75.3% and 95.4%, respectively.

  • Domain-Specific Model Advancements: Contributed to the advancement of NLP applications in the CMS domain by obtaining domain-specific PTMs that demonstrate enhanced performance on specialized tasks.

Toronto, 05/2022-06/2023

Machine Learning Research Assistant

University of Toronto

Developed automatic shoreline recognition system from aerial images using advanced ML algorithms.

  • Shoreline images are labeled and pre-processed to be used in ML algorithms.

  • Random Forest, XGBoost, and LGBM algorithms are implemented in shoreline detection. The averaged semantic segmentation accuracy for the above algorithms is 95.6%, 96.0%, and 94.8%, respectively.

  • Enhanced final shoreline accuracy by post-processing and applying Gaussian Edge Detection algorithm on results from ML algorithms.

Toronto, 05/2021-03/2022

Work & Research Experience

Teaching Assistant

University of Toronto

Supported 15+ courses across NLP, Data Science, and AI and three departments.
Led tutorials, delivered lectures, designed assignments, and mentored students on technical projects.