2.2.  Duplicate analysis

Duplicate analysis is a key tool for data quality control - especially for large volumes of data and imported catalogs. It finds duplicate candidates, but does not automatically eliminate them; instead, it forms the basis for downstream cleansing processes.

In concrete terms, this means for the process:

  • Automatic generation of clusters, where each cluster contains parts that are similar to each other.

  • Downstream manual annotation process to determine main parts and duplicates.

  • Export to CSV file