LibGuides: Identifying and Removing Duplicate Articles: Deduplication

Deduplication

What is Deduplication and Why do you have to do it?

There are two forms of deduplication to consider when conducting a systematic review. The first is the removal of identical records retrieved from multiple databases. The second is the issue of multiple articles published from the same data set. If undetected, either could create bias in the conclusions of your review.

Identifying and removing duplicate records is necessary because multiple databases often index overlapping journals. Your method of deduplication may depend on the number of articles included in your review: manual deduplication is more realistic with smaller numbers, whereas larger numbers may require automatic tools. Automatic tools are not perfect, so both methods should be used for accurate deduplication. Whichever process you decide to follow, document it and report it accurately in your article.

Identifying multiple articles published from the same data set is a bit more complicated. The Cochrane Handbook offers some good suggestions for authors here: https://training.cochrane.org/handbook/current/chapter-04#section-4-6-2, and recommends that studies (not reports) serve as the reporting unit (https://training.cochrane.org/handbook/current/chapter-04#_Ref531774783). This requires careful analysis because you don't want to leave out important articles.

You need to track the number of duplicate articles you remove for either reason for inclusion in your PRISMA diagram.

You may want to export the entire list of articles from each database to a citation manager such as EndNote, Sciwheel, Zotero, or Mendeley (including both citation and abstract in your file) and remove the duplicates there. If you are using Covidence for your review, you should also add the duplicate articles identified in Covidence to the citation manager number.

Removing Duplicates on Ebsco

THIS INFORMATION IS OUT OF DATE. I am working on updating it.

When you search in each Ebscohost database by itself the results will be deduplicated automatically, however when you search in more then one the databases, the number of results posted on the first page (unless it says duplicates have been removed), includes duplicate articles.

In this example, I will search in Cinahl and Medline together.
Enter your search terms, then click on search. Then apply your limiters. Here is my example search.
Be sure you have results per page set to 50 (which is the maximum amount). To do this click on "page options", then under "results per page" change it to 50. Then select "Apply".
You can see in this set of results that there are 74 results listed on the first page, to see how many of them are duplicates you will need to click on the last page of the results. To do this, scroll down, then click on the last page (in this case page 2).
You will notice as soon as the last page loads, at the top of the results it says "Note: Exact Duplicates Removed from Results." And the number of results has dropped to 70. Therefore, 4 of the results in this list are duplicates.

Removing Duplicates on Proquest

To remove duplicates from any ProQuest database, scroll down on the advanced search page and click on the Result page options link and then check the box next to Exclude duplicate documents.

Exporting Records from Ebsco & ProQuest Databases

Exporting Results from Proquest & Ebsco

Up to 20,000 articles from the search screen
- Ebsco
- Proquest
You can also export from folders or individual records.
- Ebsco
- Proquest

For Zotero, Mendeley, & EndNote be sure you download the RIS format.

You can also

Removing Duplicates in Zotero

Find & Merge Duplicates

It's best to have only one copy of each reference in your library. You can view duplicates Zotero has automatically identified by selecting the Duplicate Items special collection under My Library.

In Duplicate Items, you'll see multiple copies of each reference. Duplicates may come from different sources, e.g., a database and a publisher's website. If so, they'll have different details. Zotero allows you to merge references, retaining the most useful information in each field. A final merged reference will keep all attachments associated with any of the duplicates. In cases where there are identical PDFs attached to the merged references, Zotero deletes the duplicate PDFs. Annotations created in Zotero are merged in the retained PDF. If you find Zotero has deleted a PDF you wish to retain, you can restore it from the Trash.

To merge references, select one of the items in the center pane. Zotero will automatically also select anything it perceives as a duplicate.

In the details pane, Zotero will ask you to choose a "master item." Choose the one with the most complete and useful details.

For fields where duplicates differ, tap the flowchart icon to the right of the field to choose which reference's details to use for that field only.

Once you've made all desired modifications, tap the Merge items banner at the top of the details pane.

A few important things to note:

When you merge references, the merged version will appear in all collections in which any of the original (pre-merged) citations appear.
If you've used the pre-merged citations in any documents, a refresh of those documents will replace them with the merged version.
Zotero may not automatically find all your duplicates, so consider sorting your library by title occasionally to catch any additional ones.
Zotero may believe references are duplicates that are not, such as a conference paper and an article by the same authors with similar titles. Be cautious when merging duplicates.
There is no way to automatically merge all the duplicates Zotero has identified at one time. You must complete the process for each set of duplicates.

Plugin for Bulk Deduplication in Zotero

This plugin works with the new version of Zotero version 7.0.7

Go to https://github.com/ChenglongMa/zoplicate/releases/tag/3.0.8

Click on zoplicate.xpi.

In Zotero, go to "Tools" and then "plugins." Then click on the tool/gear/setting icon in the upper right corner. Then click on "Install plugins from file." Once you install it, you should see "bulk merge all duplicate items" when you click on the duplicates folder.

After you have it installed you have to go to edit, settings, and zoplicate. Under master select "most detailed".

Instructions on how to use zoplicate can be found on https://github.com/ChenglongMa/zoplicate, then scroll down.

Manual Deduplication

Export your references to a CSV or Excel file. In most cases, you will need to first use conditional formatting in Excel to identify duplicates, then do a final scan manually.

Conditional formatting

Sort the column alphabetically. (Start with titles, though you can use this same process for any other columns you choose, such as DOI.)
Select conditional formatting from the Home ribbon, go to Highlight Cells Rules, then Duplicate Values.
Replace punctuation (dashes, periods, question marks, semi colons, colons) in titles with spaces using the find and replace tool.
For titles, truncating (to 30 characters, for example, though this number is arbitrary) will sometimes find more duplicates.
- insert a blank column
- use this formula =LEFT(C2,30) where C2 is the cell you are truncating
- copy the formula down the length of the column to truncate it all

Manual scan

Sort by title
Scan through the list, looking for duplicate titles
Check the additional information (author, journal, volume, page number) to make sure it matches before designating a duplicate

DO NOT delete duplicate records. Instead, move them to a separate sheet for duplicates, to track numbers.

This process was adapted from Kwon (2015).

Removing Duplicates with EndNote

Most bibliographic management software includes a deduplication option. You might consider uploading your references to EndNote, for example, removing the duplicates and then going through the remainder of your list manually. Qi (2013) found that one method of automatic deduplicating was inadequate. See the following YouTube on integrating PRISMA with EndNote.

APU no longer supports EndNote. You can purchase a personal copy from the EndNote website.

Before deduplicating, you will need a merged EndNote library containing the records from all your separate EndNote libraries for the individual database searches if you had previously exported records from each database into separate libraries:

Create a new EndNote library that will contain the records from all the databases you searched (I like to put DEDUPING in the EndNote library name)
Import the records from each EndNote library you created for the individual database searches:
- Go to EndNote menu > File > Import >
- File Next to "Import File", browse to choose the .enl file (NOT the .enlx file) for each library of downloaded records from your searches and select "EndNote Library" as the "Import Option"
- Once all the records have been added to this new library, check to make sure the final number of records, before removing duplicates, matches the sum of the records found for all the database searches.
Using this merged library of records from your individual database searches, you are now ready to remove duplicates. Here are three methods you can use:

After you have merged your libraries, with the Library window open, click on the All References group to show all references.
From the Library menu, choose Find Duplicates.
EndNote will display a Find Duplicates dialog, where you can compare duplicates and decide which version to keep and which to delete. Be sure you keep track of the number of duplicates you are deleting.
For each set of duplicates, you have the option to:
- Click Keep This Record to save that particular reference and throw the other one in the Trash.
- Click Skip to leave both references in the library, intact. They will appear in a temporary Duplicate References group, so you can review them later.
- Click Cancel to automatically instruct EndNote to select the most recently entered version(s) of each duplicate reference as the one(s) to be removed. If you move the selected references to the trash, they will be removed from the library, removing all duplicate copies EndNote found. All duplicates (including the original copy of the reference, which will not be selected when the group window is created) will appear in a temporary Duplicate References group if you wish to review them. Do NOT simply move all references in the Duplicate References group to the trash unless you want to remove all copies of all references duplicated, including the original.

An earlier version of the "Bramer method" for deduplicating, with steps provided in Word document format:

A paper describing more advanced configuration options for removing duplicates in EndNote: Bramer WM, Giustini D, de Jonge GB, Holland L, Bekhuis T. De-duplication of database search results for systematic reviews in EndNote. Journal of the Medical Library Association: JMLA. 2016;104(3):240-243. doi:10.3163/1536-5050.104.3.014

After deduplication - Create a compressed library for backup after having removed as many duplicates as possible, with a filename like SearchTerms-yyyymmdd-Deduplicated—xRecords. lx. This will be the library for screening.

EndNote has created a video demonstrating deduplication.
Go to their tutorial to view it - EndNote.

Removing Duplicates with Automated Systematic Search Deduplicator (ASySD)

Removing duplicate references obtained from different databases is an essential step when conducting and updating systematic literature reviews. ASySD is a tool to automatically identify and remove duplicate records. Hair, et al. (2021) compared ASySD deduplication to SRA-DM & Endnote. They found that "ASySD identified more duplicates than either SRA-DM or Endnote, with a sensitivity in different datasets of 0.95 to 0.99. The false-positive rate was comparable to human performance, with a specificity of 0.94-0.99. The tool took less than 1 hour to deduplicate all datasets" (Hair, et al, 2021).

The tool is written in R and has been created as a Shiny web app available online. For very large datasets (>50,000 records) it is advisable to download the code and run locally as a Shiny app within RStudio.

Users can deduplicate records by following these steps:

Go to https://camarades.shinyapps.io/RDedup/
Upload reference library as an .XML file direct from Endnote, a .csv file or a .txt tab delimited file.
Specify any labelled records to preferentially keep in the library e.g. keep references obtained in a previous search labelled as "old" over the same records found in a new search
Navigate to the Deduplicate data tab and click a single button in the Automated Deduplication section to remove duplicates automatically. Depending on the size of the dataset, this can take several minutes. ASySD will highlight how many records have been removed.
Remove any additional duplicates manually under the Manual Deduplication section. Select the IDs you want to remove from the side-by-side table of matching duplicates and click to remove from your reference library
Download unique reference library (there are several options here - for example to download your library after automated de-duplication only or to download the reference pairs ASySD detected)

Removing Duplicates with the SRA Deduplicator (SRA-DM)

Removing duplicate records with the IEBH SR-Accelerator Deduplicator

Use the online Deduplicator tool. This is a new version as of August 2021.
Help using Deduplicator

Large sets of records

The offline DeDuplicator (old version) may be useful for large sets of records (which used to be considered ≥ 2000). Download the SRA-dedupe-UI application from GitHub. As of October 2020, there were only Linux and Windows versions available.
Help using DeDuplicator Offline
When exporting your records to an XML file, don't forget to select them all (Help Importing/Exporting EndNote records)
If you choose to use the stand alone executable version for Windows, you may get a message that "Windows protected your PC": Click on More info and then a Run anyway button will appear, which you should click if you feel comfortable trusting the software developers
Keep a copy of the RIS or XML file for your records.

Articles about Deduplication

Bramer, W. M., Giustini, D., de Jonge, G. B., Holland, L., & Bekhuis, T. (2016). De-duplication of database search results for systematic reviews in EndNote. Journal of the Medical Library Association : JMLA, 104(3), 240–243. https://doi.org/10.3163/1536-5050.104.3.014

Kwon, Y., Lemieux, M., McTavish, J., & Wathen, N. (2015). Identifying and removing duplicate records from systematic review searches. Journal of the Medical Library Association : JMLA, 103(4), 184–188. https://doi.org/10.3163/1536-5050.103.4.004

Qi, X., Yang, M., Ren, W., & Jia, J. (2013). Find duplicates among the PubMed, EMBASE, and Cochrane Library databases in systematic review. PLoS ONE 8(8): e71838. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071838

Rathbone, J., Carter, M., Hoffmann, T., & Glasziou, P. (2015). Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module. Systematic Reviews, 4(6), 1-6. https://doi.org/10.1186/2046-4053-4-6

Hair, K., Bahor, Z., Macleod, M.R., Liao, J., & Sena, E.S. (2021). The Automated Systematic Search Deduplicator (ASySD): a rapid, open-source, interoperable tool to remove duplicate citations in biomedical systematic reviews. bioRxiv. https://doi.org/10.1101/2021.05.04.442412

Identifying and Removing Duplicate Articles