Cainã Max Couto da Silva

Postdoctoral Data Scientist

UW-Madison

About me

As a highly skilled data scientist with a PhD in bioinformatics and over ten years of working on relevant projects, I developed a strong data science and analytics foundation. I have spent the last few years working at world-renowned companies, developing end-to-end machine learning applications. Additionally, driven by my passion for knowledge, I’ve taught specialized courses in various data science topics.

Interests

Data Science
Artificial Intelligence
Generative AI (LLMs)
Machine Learning Engineering

Education

MBA in Data Science & Analytics
Universidade de São Paulo
PhD in Bioinformatics
Universidade de São Paulo

Current experience

Postdoctoral Data Scientist

UW-Madison - Top 15 public universities in the US

August 2024 – Present Madison - WI, USA

Currently developing a RAG-based text-to-SQL AI agent that allows users to interact with the database using Slack.
Developed SQLDeps: an open-source Python package leveraging LLMs to automatically extract table and column dependencies and outputs from complex SQL scripts 100X faster and >300X cheaper than human expert labor.
Developed and optimized cattle mapping using cutting-edge deep learning models on high‑resolution satellite imagery, providing actionable intelligence to combat Amazon deforestation.
Engineered a pioneering machine learning classifier to assess data quality from farm properties at scale, implementing feature engineering on geospatial and entity data that enabled scalable data integrity checks for the first time.
Lead UW‑Madison students in research projects, mentoring on machine learning and computer vision techniques.
Automated database workflows, reducing manual intervention and boosting team productivity.
Delivered private data analysis reports for stakeholders in Brazil and the USA.

Tools: Python, SQL, GitHub/GitLab, git, AWS, LangChain, PyTorch, YOLO, LLMs, Streamlit, machine learning libraries (e.g., scikit-learn), data visualization libraries, etc.

Industry experience

Data Scientist

Schlumberger - World’s largest offshore drilling company

January 2023 – July 2024 Houston - TX, USA (remote)

I work developing end-to-end AI SaaS products to our internal customers.

Main activities:

Building predictive models using either statistical or machine learning approaches to assess the health of the company assets (tools)
Assessing the technical feasibility of new projects through data analysis

Deliveries:

Statistical models to assess the asset health
Machine learning models to predict the asset health for the upcoming usage
A custom-trained OCR model to extract dimensions of engineering drawings

Quick facts:

As result of my first project, our work has been accepted to be published as a scientific paper at OnePetro.
I led a innovation proposal using AI, and we ranked in the top 12 out of almost 400 ideas worldwide. My colleague and I gave the final pitch to the CEOs.
I work in one of the most diverse teams in our company, interacting with people from US, India, Europe, and South America in my daily routine.

Tools: Dataiku, GCP, SQL, Python, Dash, Streamlit, machine learning libraries (e.g., scikit-learn), data visualization libraries.

Data Science Consultor (Education)

DNC - Edtech

October 2021 – July 2024 São Paulo - SP, Brazil (remote)

I have worked in multiple roles: facilitator, mentor, and consultant/instructor.

As a consultant/instructor, I prepared the course modules and recorded classes. I recorded four modules: statistics, data cleaning/wrangling, clustering, and model deployment. I have also devised data science activities about descriptive and inferential statistics, unsupervised machine learning models, MLFlow, big data with PySpark, among others.

As a mentor, I assessed student reports and addressed their questions through Q&A sessions, aiding in academic and real-world projects.

As a facilitator, I participated in data science activities from exploratory analysis to model deployment. I also prepared some of those activities.

Data Scientist

Ambev Tech - World’s largest beer brewer company

May 2022 – December 2022 São Paulo - SP, Brazil (remote)

Squad: Revenue Management

Activities / Deliverables:

Rule-based automation of pipelines for price engines using PySpark (big data)
Statistical model to identify price changepoints for several SKU categories
Monthly time series modeling from of selling volume using state-of-the-art forecast and hierarchical reconciliation methods
Training for data scientists

Tools: • Python • PySpark • MLFlow • Scikit-learn • Pycaret • Statsmodels • Forecasting frameworks (Prophet, NeuralProphet, StatsForecast)

Data Scientist

Remessa Online - Fintech

October 2021 – April 2022 São Paulo - SP, Brazil (remote)

Activities:

Advanced statistical modeling: data exploratory analysis, hypothesis testing, time series forecasting, predictive analyses (e.g., regression and classification), customer and product segmentation, etc.

Deliverables:

Multiple time series forecasting for customer segments (weekly and monthly)
Model for predicting the probability of the customer recurrence
In-depth study of churn analysis through inferential statistics

Tools: • Python • PySpark • MLFlow • Scikit-learn • Pycaret • Catboost • Prophet • AWS • GitHub • Data visualization libraries (Plotly, Seaborn, Matplotlib)

Safety Data Sciences Associate

Eli Lilly

June 2021 – October 2021 Indianopolis - IN, USA (remote)

Activities:

Queries and reports for worldwide company members

Academic experience

MBA in Data Science & Analytics

Universidade de São Paulo

May 2021 – August 2023 São Paulo - SP, Brazil

Grade: 10

In‑depth study of machine learning models.
Developed an end‑to‑end hybrid ML model for churn prediction.
Project code repository

PhD Researcher

Universidade de São Paulo

July 2016 – April 2021 São Paulo - SP, Brazil

Thesis: Identifying natural selection in Native American populations. Supported by: CAPES (2016 - 2018) and FAPESP (2018 - 2020)

Activities:

Data analysis, visualization, and scientific reporting of genetic data using R, Python, and bash scripting.
Application of non‑supervised algorithms (e.g., PCA), descriptive and inferential statistics.

Deliverables:

Internal R package with customized functions to facilitate multiple analyses
Three scientific papers published in international magazines
Thesis code repository

I presented our preliminar work in international conferences, including USA. I also took a internship of 3 months in Barcelona - Spain.

Master in Science

Universidade de São Paulo

April 2014 – March 2016 São Paulo - SP, Brazil

Dissertation: Role of cellular prion protein and its ligand, stip1, in the adult neurogenesis. Supported by: CNPq (2014 - 2016)

Main techniques: primary cell culture, immunofluorescence, and hypothesis testing.

Bachelor in Biological Science

Universidade de São Paulo

February 2011 – December 2013 São Paulo - SP, Brazil

Monography: Role of the interaction between the cellular prion protein and its ligand, STI1, in the biology of neural precursors from the murine adult brain. Supported by: University scholarship (2011 - 2012) and FAPESP (2012 - 2013)

Honor & Awards:

Best academic performance
Professor Bertha Lange de Morretes’ Award

Publications

Mitigating Nonproductive Time: A Novel Algorithm for Dsl Fault Detection

We built a data-driven framework for cable fault detection. We validated our approach using 60 random labeled files, achieving 98% for both accuracy and F1-score.

Cainã Max Couto da Silva , Shruthi Shetty, Alejandro Olid-Gonzalez, Gery Wallez, Vincent Chatelet, Abhinav Kohar

Indigenous people from Amazon show genetic signatures of pathogen-driven selection

This study analyzed the genomes of native Amazon populations, finding evidence of genetic adaptation to combat Chagas disease, a prevalent tropical illness in the region.

Cainã Max Couto da Silva , Kelly Nunes, Gabriela Venturini, Marcos A. Castro-e-Silva, Lygia V. Pereira, David Comas, Tábita Hünemeier

Population Histories and Genomic Diversity of South American Natives

Study on 58 native South American populations reveals genetic clusters.

Marcos A. Castro-e-Silva, Tiago Ferraz, Cainã Max Couto da Silva , Renan B. Lemes, Kelly Nunes, David Comas, Tábita Hünemeier

Selection scan reveals three new loci related to high altitude adaptation in Native Andeans

This study analyzed genomic data from Native Americans in the Andean highlands and lowland areas, identifying genes for high-altitude adaptation in Andeans related to the hypoxia response.

Vanessa C. Jacovas, Cainã Max Couto da Silva , Kelly Nunes, Renan B. Lemes, Marcelo Z. de Oliveira, Francisco M. Salzano, Maria Cátira Bortolini, Tábita Hünemeier