Duplicates Cleaner
In genetic algorithms, one way to maintain diversity is to eliminate duplicates generation after generation. This operation can be computationally expensive but is often essential to ensure that the algorithm continues to explore new individuals that can improve the objectives.
Info
Benchmarks comparing pymoors/moors with other Python/Rust frameworks will be published soon. These benchmarks will highlight the importance and performance impact of duplicate elimination in genetic algorithms.
A duplicates cleaner in moors
is any type that implements the PopulationCleaner trait. For example:
use std::collections::HashSet;
use ndarray::Array2;
use ordered_float::OrderedFloat;
use crate::duplicates::PopulationCleaner;
#[derive(Debug, Clone)]
/// Exact duplicates cleaner based on Hash
pub struct ExactDuplicatesCleaner;
impl ExactDuplicatesCleaner {
pub fn new() -> Self {
Self
}
}
impl PopulationCleaner for ExactDuplicatesCleaner {
fn remove(&self, population: Array2<f64>, reference: Option<&Array2<f64>>) -> Array2<f64> {
let ncols = population.ncols();
let mut unique_rows: Vec<Vec<f64>> = Vec::new();
// A HashSet to hold the hashable representation of rows.
let mut seen: HashSet<Vec<OrderedFloat<f64>>> = HashSet::new();
// If a reference is provided, first add its rows into the set.
if let Some(ref_pop) = reference {
for row in ref_pop.outer_iter() {
let hash_row: Vec<OrderedFloat<f64>> =
row.iter().map(|&x| OrderedFloat(x)).collect();
seen.insert(hash_row);
}
}
// Iterate over the population rows.
for row in population.outer_iter() {
let hash_row: Vec<OrderedFloat<f64>> = row.iter().map(|&x| OrderedFloat(x)).collect();
// Insert returns true if the row was not in the set.
if seen.insert(hash_row) {
unique_rows.push(row.to_vec());
}
}
// Flatten the unique rows into a single vector.
let data_flat: Vec<f64> = unique_rows.into_iter().flatten().collect();
Array2::<f64>::from_shape_vec((data_flat.len() / ncols, ncols), data_flat)
.expect("Failed to create deduplicated Array2")
}
}
The main method to implement is remove
, which takes two arguments: population
and optional reference
. If reference
is provided, duplicates are determined by comparing each row in the population to all rows in the reference. The last is very important, when new offsprings are created, they may be unique, but if we compare them with the current population (in this case reference
) they may not be unique.
Currently pymoors doesn't support user-defined duplicates cleaner. This feature will be available soon once this issue is done
Exact Duplicates Cleaner
Based on exact elimination, meaning that two individuals (genes1 and genes2) are considered duplicates if and only if each element in genes1 is equal to the corresponding element in genes2. Internally, this cleaner uses Rust’s HashSet.
To use close duplicates cleaner in moors
just pass to the algorithm the cleaner instance
use ndarray::{array, Array2};
use moors::ExactDuplicatesCleaner;
let raw_data = vec![
1.0, 2.0, 3.0, // row 0
4.0, 5.0, 6.0, // row 1
1.0, 2.0, 3.0, // row 2 (duplicate of row 0)
7.0, 8.0, 9.0, // row 3
4.0, 5.0, 6.0, // row 4 (duplicate of row 1)
];
let population =
Array2::<f64>::from_shape_vec((5, 3), raw_data).expect("Failed to create test array");
let cleaner = ExactDuplicatesCleaner::new();
let cleaned = cleaner.remove(population, None);
let expected = array![
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
];
assert_eq!(cleaned, expected);
To use exact duplicates cleaner in pymoors
just pass to the algorithm the cleaner object
import numpy as np
from pymoors import ExactDuplicatesCleaner
cleaner = ExactDuplicatesCleaner()
raw_data = np.array([
[1.0, 2.0, 3.0], # row 0
[4.0, 5.0, 6.0], # row 1
[1.0, 2.0, 3.0], # row 2 (duplicate of row 0)
[7.0, 8.0, 9.0], # row 3
[4.0, 5.0, 6.0], # row 4 (duplicate of row 1)
])
cleaned = cleaner.remove(raw_data)
expected = np.array([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
])
np.testing.assert_array_equal(cleaned, expected)
Close Duplicates Cleaner
This is designed for real-valued problems where two individuals are very similar, but not exactly identical. In such cases, it is beneficial to consider them as duplicates based on a proximity metric—typically the Euclidean distance. This cleaner implements a cleaner of this style that uses the square of the Euclidean distance between two individuals and considers them duplicates if this value is below a configurable tolerance (epsilon).
To use close duplicates cleaner in moors
just pass to the algorithm the cleaner instance
use ndarray::array;
use moors::CloseDuplicatesCleaner;
let population = array![[1.0, 2.0, 3.0], [10.0, 10.0, 10.0]];
let reference = array![[1.01, 2.01, 3.01]]; // close to row 0 of population
let epsilon = 0.05;
let cleaner = CloseDuplicatesCleaner::new(epsilon);
let cleaned = cleaner.remove(population, Some(&reference));
// Row 0 should be removed.
assert_eq!(cleaned.nrows(), 1);
assert_eq!(cleaned.row(0).to_vec(), vec![10.0, 10.0, 10.0]);
To use close duplicates cleaner in pymoors
just pass to the algorithm the cleaner object
import numpy as np
from pymoors import CloseDuplicatesCleaner
population = np.array([[1.0, 2.0, 3.0], [10.0, 10.0, 10.0]])
reference = np.array([[1.01, 2.01, 3.01]]) # close to row 0 of population
epsilon = 0.05;
cleaner = CloseDuplicatesCleaner(epsilon=1e-5)
cleaned = cleaner.remove(population, Some(&reference));
# Row 0 should be removed.
assert len(cleaned) == 1;
np.testing.assert_array_equal(cleaned, np.array([10.0, 10.0, 10.0]));
Caution
This duplicate elimination algorithm can be computationally expensive when the population size and the number of offsprings are large, because it requires calculating the distance matrix among offsprings first, and then between offsprings and the current population to ensure duplicate elimination using this criterion. The algorithm has a complexity of O(n*m) where n is the population size and m is the number of offsprings.