Rada
Menczel

The Secret Sauce of Data Preprocessing in Machine Learning

SentinelOne

Rada
Menczel

The Secret Sauce of Data Preprocessing in Machine Learning

SentinelOne

Bio

Rada Menczel is head of machine learning at SentinelOne, a cybersecurity startup. In her role, she leads a team of data scientists and is responsible for all aspects of the ML models in the company. Rada has vast experience in researching algorithms and models both in cyber and fintech domains. Rada has an MSc in Information Systems Engineering from Ben-Gurion University, specializing in machine learning and recommender systems. She is enthusiastic about data science, machine learning, deep learning, and anything at all related to learning.

Bio

Rada Menczel is head of machine learning at SentinelOne, a cybersecurity startup. In her role, she leads a team of data scientists and is responsible for all aspects of the ML models in the company. Rada has vast experience in researching algorithms and models both in cyber and fintech domains. Rada has an MSc in Information Systems Engineering from Ben-Gurion University, specializing in machine learning and recommender systems. She is enthusiastic about data science, machine learning, deep learning, and anything at all related to learning.

Abstract

When data scientists wish to train new models, they have a general idea of what their flow will look like. Assuming that the problem they need to solve is well defined, they need to explore the data, define labels, visualize, train, evaluate, tune and test. The most time consuming and often tedious part is data preprocessing and preparation. Should you not fully invest in this stage, you may still get a decent model – but is that good enough? What if I told you that by adding a small step, you can improve your model results and achieve greatness?

In this talk, I will present a problem that is often being ignored – identical feature vectors with different labels. We will discuss why this happens and how you can solve it in different ways in all possible domains. By the end of this discussion, you will wonder how you ever preprocessed without this phase.

Abstract

When data scientists wish to train new models, they have a general idea of what their flow will look like. Assuming that the problem they need to solve is well defined, they need to explore the data, define labels, visualize, train, evaluate, tune and test. The most time consuming and often tedious part is data preprocessing and preparation. Should you not fully invest in this stage, you may still get a decent model – but is that good enough? What if I told you that by adding a small step, you can improve your model results and achieve greatness?

In this talk, I will present a problem that is often being ignored – identical feature vectors with different labels. We will discuss why this happens and how you can solve it in different ways in all possible domains. By the end of this discussion, you will wonder how you ever preprocessed without this phase.