Skip to content

Duplicates Cleaner

In genetic algorithms, one way to maintain diversity is to eliminate duplicates generation after generation. This operation can be computationally expensive but is often essential to ensure that the algorithm continues to explore new individuals that can improve the objectives.

Info

Benchmarks comparing pymoors/moors with other Python/Rust frameworks will be published soon. These benchmarks will highlight the importance and performance impact of duplicate elimination in genetic algorithms.

A duplicates cleaner in moors is any type that implements the PopulationCleaner trait. For example:

use std::collections::HashSet;

use ndarray::Array2;
use ordered_float::OrderedFloat;

use crate::duplicates::PopulationCleaner;

#[derive(Debug, Clone)]
/// Exact duplicates cleaner based on Hash
pub struct ExactDuplicatesCleaner;

impl ExactDuplicatesCleaner {
    pub fn new() -> Self {
        Self
    }
}

impl PopulationCleaner for ExactDuplicatesCleaner {
    fn remove(&self, population: Array2<f64>, reference: Option<&Array2<f64>>) -> Array2<f64> {
        let ncols = population.ncols();
        let mut unique_rows: Vec<Vec<f64>> = Vec::new();
        // A HashSet to hold the hashable representation of rows.
        let mut seen: HashSet<Vec<OrderedFloat<f64>>> = HashSet::new();

        // If a reference is provided, first add its rows into the set.
        if let Some(ref_pop) = reference {
            for row in ref_pop.outer_iter() {
                let hash_row: Vec<OrderedFloat<f64>> =
                    row.iter().map(|&x| OrderedFloat(x)).collect();
                seen.insert(hash_row);
            }
        }

        // Iterate over the population rows.
        for row in population.outer_iter() {
            let hash_row: Vec<OrderedFloat<f64>> = row.iter().map(|&x| OrderedFloat(x)).collect();
            // Insert returns true if the row was not in the set.
            if seen.insert(hash_row) {
                unique_rows.push(row.to_vec());
            }
        }

        // Flatten the unique rows into a single vector.
        let data_flat: Vec<f64> = unique_rows.into_iter().flatten().collect();
        Array2::<f64>::from_shape_vec((data_flat.len() / ncols, ncols), data_flat)
            .expect("Failed to create deduplicated Array2")
    }
}

The main method to implement is remove, which takes two arguments: population and optional reference. If reference is provided, duplicates are determined by comparing each row in the population to all rows in the reference. The last is very important, when new offsprings are created, they may be unique, but if we compare them with the current population (in this case reference) they may not be unique.

Currently pymoors doesn't support user-defined duplicates cleaner. This feature will be available soon once this issue is done

Exact Duplicates Cleaner

Based on exact elimination, meaning that two individuals (genes1 and genes2) are considered duplicates if and only if each element in genes1 is equal to the corresponding element in genes2. Internally, this cleaner uses Rust’s HashSet.

To use close duplicates cleaner in moors just pass to the algorithm the cleaner instance

use ndarray::{array, Array2};

use moors::ExactDuplicatesCleaner;

let raw_data = vec![
    1.0, 2.0, 3.0, // row 0
    4.0, 5.0, 6.0, // row 1
    1.0, 2.0, 3.0, // row 2 (duplicate of row 0)
    7.0, 8.0, 9.0, // row 3
    4.0, 5.0, 6.0, // row 4 (duplicate of row 1)
];
let population =
    Array2::<f64>::from_shape_vec((5, 3), raw_data).expect("Failed to create test array");

let cleaner = ExactDuplicatesCleaner::new();
let cleaned = cleaner.remove(population, None);

let expected = array![
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0],
];

assert_eq!(cleaned, expected);

To use exact duplicates cleaner in pymoors just pass to the algorithm the cleaner object

import numpy as np

from pymoors import ExactDuplicatesCleaner

cleaner = ExactDuplicatesCleaner()

raw_data = np.array([
    [1.0, 2.0, 3.0],  # row 0
    [4.0, 5.0, 6.0],  # row 1
    [1.0, 2.0, 3.0],  # row 2 (duplicate of row 0)
    [7.0, 8.0, 9.0],  # row 3
    [4.0, 5.0, 6.0],  # row 4 (duplicate of row 1)
])

cleaned = cleaner.remove(raw_data)

expected = np.array([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0],
])

np.testing.assert_array_equal(cleaned, expected)

Close Duplicates Cleaner

This is designed for real-valued problems where two individuals are very similar, but not exactly identical. In such cases, it is beneficial to consider them as duplicates based on a proximity metric—typically the Euclidean distance. This cleaner implements a cleaner of this style that uses the square of the Euclidean distance between two individuals and considers them duplicates if this value is below a configurable tolerance (epsilon).

To use close duplicates cleaner in moors just pass to the algorithm the cleaner instance

use ndarray::array;

use moors::CloseDuplicatesCleaner;

let population = array![[1.0, 2.0, 3.0], [10.0, 10.0, 10.0]];
let reference = array![[1.01, 2.01, 3.01]];  // close to row 0 of population
let epsilon = 0.05;
let cleaner = CloseDuplicatesCleaner::new(epsilon);
let cleaned = cleaner.remove(population, Some(&reference));
// Row 0 should be removed.
assert_eq!(cleaned.nrows(), 1);
assert_eq!(cleaned.row(0).to_vec(), vec![10.0, 10.0, 10.0]);

To use close duplicates cleaner in pymoors just pass to the algorithm the cleaner object

import numpy as np

from pymoors import CloseDuplicatesCleaner

population = np.array([[1.0, 2.0, 3.0], [10.0, 10.0, 10.0]])
reference = np.array([[1.01, 2.01, 3.01]]) # close to row 0 of population
epsilon = 0.05;

cleaner = CloseDuplicatesCleaner(epsilon=1e-5)

cleaned = cleaner.remove(population, Some(&reference));
# Row 0 should be removed.
assert len(cleaned) == 1;
np.testing.assert_array_equal(cleaned, np.array([10.0, 10.0, 10.0]));

Caution

This duplicate elimination algorithm can be computationally expensive when the population size and the number of offsprings are large, because it requires calculating the distance matrix among offsprings first, and then between offsprings and the current population to ensure duplicate elimination using this criterion. The algorithm has a complexity of O(n*m) where n is the population size and m is the number of offsprings.