1.1.1.2.1.  Duplicate analysis

Duplicate analysis is a key tool for data quality control - especially for large volumes of data and imported catalogs. It finds duplicate candidates, but does not automatically eliminate them; instead, it forms the basis for downstream cleansing processes.

In concrete terms, this means for the process:

  • Automatic generation of clusters, where each cluster contains parts that are similar to each other.

  • Downstream manual annotation process to determine main parts and duplicates.

  • Export to CSV file

The following example provides a brief overview of how it works.

  1. Open the dashboard and select the Duplicate Analysis menu item.

  2. Click on the Create Report button.

    -> The settings dialog opens.

  3. Fill in the individual points.

    In particular, select the search directory and the target directory.

    Determine the minimum similarity.

    Finally, click on Create report.

    -> The map of the newly created report is displayed.

  4. Open the report with one click.

    The report page is structured as follows:

    The header contains the name of the report, a filter area and an export CSV button.

    The main area is divided into structure tree [left], results [center] and an overview [right] (will be adapted according to the work on the individual clusters).

  5. Click on a cluster to open it.

    All parts of a cluster start as non-annotated candidates (Main = 0 and Duplicates = 0). There is no main part yet.

  6. Open a cluster by clicking on it. The duplicate buttons are deactivated as long as no main part exists.

  7. Determine a main part (duplicate candidate → main part)

    A candidate becomes the Main Part by:

    • clicks on the annotation button Main Part

      -> The button is filled with the blue base color.

      ->

      or

    • drag & drop the candidate into the main part drop zone.

      Drag the desired candidate into the drag & drop zone

      Drag the desired candidate into the drag & drop zone

      -> The candidate is now Main Part ; i.e. the button is filled in and the Main Part is displayed on the right in the Duplicates area.

      Result: The candidate is now Main Part.

      Result: The candidate is now Main Part.

  8. Assignment of duplicates

    A candidate can be annotated as a duplicate by

    Click on the Duplicate button [Duplicate]

    -> The button is filled with the green base color.

    ->

    or by dragging and dropping the candidate onto an existing main part.

    If several main parts exist, a selection list opens to select the target main part.

    In any case, the button is now fully filled in green and the duplicate is displayed on the right in the Duplicates area under the Main Part.

  9. Now proceed in the same way with all other duplicate candidates:

    • Several main parts are possible.

    • A main part does not necessarily have to have duplicates.

  10. The aim is for a cluster to be set as completed, i.e. onlycontaining main parts and assigned duplicates.

  11. You can monitor progress at any time in the structure tree on the left.

    The colors in the tree help to quickly find open clusters and to reopen problematic clusters:

    • White = no cluster completed yet

    • Gray = At least one cluster has been completed here, but more still need to be processed.

    • Yellow = There is a ToCheck part here, which must be completed in any case.

    • Green = Everything completed

      But yellow beats green, i.e. if all clusters are completed (green), but there is a "To be checked" part in one of these clusters (yellow), then the folder is marked yellow.

  12. Via Comparison Button can load parts into the comparison at any time .

    Operations in the comparison and in the duplicate analysis run synchronously.

    The compare button of the cluster itself (at the very top) and the one on the right-hand side (master part) replace all parts that are in the part comparison up to that point.

    The comparison buttons in the parts list (duplicate candidates of the cluster) add the respective part individually without deleting the previous ones.

    The basic principles for comparing duplicate parts are the same as for the standard; a few features have been added here:

    • Up to 10 parts can be loaded (only 4 as standard)

    • Parts can be annotated via icons at the top.

  13. By clicking on the Export CSV button, you can then perform an export for all clusters ( All option) or only for intermediate statuses ( Current view option).

Details can be found under Section 2.2, “ Duplicate analysis ” in ENTERPRISE 3Dfindit (Professional) - Administration.