Skip to main content

Data Preparation and Model Evaluation

This page provides a unified set of best practices for data preparation and model evaluation applicable across various machine learning tasks, including regression, classification, time series forecasting, and anomaly detection.

Data Preparation

High-quality data is the foundation of any effective machine learning model. The following steps are crucial for preparing your data for analysis. Leveraging in-database tools like SAP HANA's Predictive Analysis Library (PAL) can significantly improve efficiency by minimizing data movement.

Data Quality and Cleaning

  • Accuracy and Completeness: Ensure that your data, especially independent variables, is accurate, complete, and free from errors.
  • Handling Null Values: SAP HANA PAL functions may abort in the presence of NULL values. It is essential to have a strategy for handling them. You can replace NULLs with default values using SQL or employ imputation techniques. The hana_ml.algorithms.pal.preprocessing.Imputer offers various strategies like “mean”, “median”, “most_frequent”, “delete”, and “als”.
  • Outlier Management: Outliers can distort analysis and model accuracy. Identify and handle them using techniques like capping, smoothing, or replacing them with interpolated values.
  • Sufficient Data: Ensure you have enough data points, considering the number of features and the noise level in your data.
  • Validation: Use techniques like cross-validation and residual analysis to validate your imputation strategies and overall data quality.