[Resolved] Finding duplicate phrases
[Resolved] Finding duplicate phrases
Problem: I've been working on some really long docs written by a bunch of different people and I keep running across very similar text over and over again. I was wanting to find a way to track this down in a more automatic way and found a few tricks. Years back someone posted about a program that will analyze text files for word frequency. It turns out that NoteTab will also do this (from the menu, select Tools - Text Statistics - Word Frequency) but what I really need is to find phrases. So for example if "laser focused on outcomes" comes up 5x in a short document, you know you need an edit.
Does anyone know of a program for this?
A few programs came close
* Text Deduplicator Plus - Looks portable but only checks lines rather than phrases
* A LibreOffice/OpenOffice trick with a similar limitation.
The various Word VB scripts out there (like this one) sadly aren't working for me.
Does anyone know of a program for this?
A few programs came close
* Text Deduplicator Plus - Looks portable but only checks lines rather than phrases
* A LibreOffice/OpenOffice trick with a similar limitation.
The various Word VB scripts out there (like this one) sadly aren't working for me.
Re: Finding duplicate phrases
From where I stand what you're looking for is a specialized kind of software generally called concordancers (https://en.wikipedia.org/wiki/Concordancer). A decade back I would have some ready suggestions for you but too much time has passed since.
Nevertheless, I hazily recall that you could feed Word for Windows a text list of expressions (one per line) and it would automark every occurrence in a given document in order for an index to be generated...
Nevertheless, I hazily recall that you could feed Word for Windows a text list of expressions (one per line) and it would automark every occurrence in a given document in order for an index to be generated...
- __philippe
- Posts: 687
- Joined: Wed Jun 26, 2013 2:09 am
Re: Finding duplicate phrases
Selection of free concordance tools offered by Yatsko's Computational Linguistics Laboratory :
Intro :
http://yatsko.zohosites.com/about-us.html
Tools :
http://yatsko.zohosites.com/products.html
Intro :
http://yatsko.zohosites.com/about-us.html
Tools :
http://yatsko.zohosites.com/products.html
Re: Finding duplicate phrases
Thanks Midas and __philippe for these very useful suggestions.
---
This topic took me some time to get back to as a had a bunch of document research and then none for over a year. Anyway, adTAT is a great intro tool and is portable.
Steps:
1. Download and extract the contents of the installation using 7zip
2. Launch adTAT.exe
Portable: yes, saves no settings. Stealth: untested
Uses: text files only, but can open any number of text files for expansive research.
Requires: Java
You can already see it picking out some patterns from a license document on my machine:
---
This topic took me some time to get back to as a had a bunch of document research and then none for over a year. Anyway, adTAT is a great intro tool and is portable.
Steps:
1. Download and extract the contents of the installation using 7zip
2. Launch adTAT.exe
Portable: yes, saves no settings. Stealth: untested
Uses: text files only, but can open any number of text files for expansive research.
Requires: Java
You can already see it picking out some patterns from a license document on my machine:
Re: Finding duplicate phrases
This may not fit under the umbrella of duplicate phrasing, but it's a related text analysis program that's super simple and with broad appeal when we're all drowning in information.
HR Automation Tool (HRAT)
Background: one of the first instances I ever heard about in document analysis where companies would dig through a host of documents, looking for a few clear keywords. As the program description points out, this can be used for things other than resume analysis such as:
Screenshot
Example output (spreadsheet)
Website
https://sourceforge.net/projects/hrat/
http://www.softpedia.com/get/Office-too ... yzer.shtml (no idea why it has a different name on Softpedia)
Portability
Portable, requires Java
Steps:
1. Download and install to the default location
2. Modify the file HRAutomationTool.ini file to change the following lines:
For the line that starts with "Class Path", replace with: Class Path=.\CVA_Data\CVA.jar;
For the line that starts with "Splash Screen", replace with: Splash Screen=.\CVA_Data\splash_Full.jpg
3. Move to a folder of your choice and launch HRAutomationTool.exe
HR Automation Tool (HRAT)
Background: one of the first instances I ever heard about in document analysis where companies would dig through a host of documents, looking for a few clear keywords. As the program description points out, this can be used for things other than resume analysis such as:
- Comparing white papers for relevance
- Checking if proposals actually at discuss specific service requirements.
- Looking for specific and important terms
- Works on entire folders and their subfolders
- Accepts a list of words and then provides results of this collection of terms.
- Somewhat buggy. Had some difficulty selecting the word list option ("Unlimited Skill Set Analysis Mode").
- This was developed roughly 6 years ago and may or may not function with recent Microsoft Office / PDF version requirements.
- You're going to want to export the results to Excel. The internal analysis tools are fairly limited.
- EDIT: Doesn't support DOCX files (development stopped I believe a little before this format was ready). There are a number of easy batch tools in LibreOffice to convert from DOCX to either earlier DOC or TXT formats so that's annoying but not impossible to resolve.
Screenshot
Example output (spreadsheet)
Website
https://sourceforge.net/projects/hrat/
http://www.softpedia.com/get/Office-too ... yzer.shtml (no idea why it has a different name on Softpedia)
Portability
Portable, requires Java
Steps:
1. Download and install to the default location
2. Modify the file HRAutomationTool.ini file to change the following lines:
For the line that starts with "Class Path", replace with: Class Path=.\CVA_Data\CVA.jar;
For the line that starts with "Splash Screen", replace with: Splash Screen=.\CVA_Data\splash_Full.jpg
3. Move to a folder of your choice and launch HRAutomationTool.exe
Re: Finding duplicate phrases [Resolved]
Resolved
In digging around for some of the posts, I solved the duplicate phrases question. Sadly, it was a program I overlooked many years ago:
MatnPardaz - Free Word (and phrase) Frequency Counter
Note that I updated the thread topic to include "and phrase". If that was included initially, I might have found that program sooner.
I still have a lot of work ahead of me looking at the various recommended Concordancer tools, which are a little more intelligent than the MatnPardaz program.
In digging around for some of the posts, I solved the duplicate phrases question. Sadly, it was a program I overlooked many years ago:
MatnPardaz - Free Word (and phrase) Frequency Counter
Note that I updated the thread topic to include "and phrase". If that was included initially, I might have found that program sooner.
I still have a lot of work ahead of me looking at the various recommended Concordancer tools, which are a little more intelligent than the MatnPardaz program.
Re: [Resolved] Finding duplicate phrases
FTR, dtSearch (https://www.dtsearch.com/) was the best commercial product I have ever tested for all around phrase searching -- when it was still shareware... I believe that if you're a developer, you can request a fully functional evaluation copy at their website.
Re: [Resolved] Finding duplicate phrases
Oh wow ... just the hit highlighting might make my month.Midas wrote: ↑Sun Jul 08, 2018 1:41 pm FTR, dtSearch (https://www.dtsearch.com/) was the best commercial product I have ever tested for all around phrase searching -- when it was still shareware... I believe that if you're a developer, you can request a fully functional evaluation copy at their website.
Thanks for that. Again
Re: [Resolved] Finding duplicate phrases
Adding this to the concordance thread, just because I don't suspect there's a wide audience ...
---
WordStatix is a basic document / word analysis with some nice extras:
Steps: Download installation file, run Uniextract2
Resources: RAM: 4.7 M Disk: 4.6 M (less if you delete the PDF manual)
Status: On the fence. It writes mostly incidental settings to appdata. Ignoring those values it might be acceptable.
License: GPL
Sites:
Related links:
---
WordStatix is a basic document / word analysis with some nice extras:
Steps: Download installation file, run Uniextract2
Resources: RAM: 4.7 M Disk: 4.6 M (less if you delete the PDF manual)
Status: On the fence. It writes mostly incidental settings to appdata. Ignoring those values it might be acceptable.
License: GPL
Sites:
https://www.softpedia.com/get/Office-to ... atix.shtml
https://sites.google.com/site/wordstatix/
https://sites.google.com/site/wordstatix/files
https://sites.google.com/site/wordstatix/
https://sites.google.com/site/wordstatix/files
Related links:
- Dev posted about it initially here: https://forum.lazarus.freepascal.org/in ... ic=33798.0
- A good program summary: https://listoffreeware.com/best-free-co ... e-windows/
Re: Finding duplicate phrases [Resolved]
For those interested, a program that has many of MatnPardaz features and more. Unfortunately it's from a shareware company that's disappeared:
https://web.archive.org/web/20100916094 ... extanz.jsp
https://en.lo4d.com/s/Textanz
EDIT: I haven't tested the program yet so its unclear what happens after the trial period.
Re: [Resolved] Finding duplicate phrases
Even though this thread has already been resolved, I'm still interested in tools that might go further.
NOTE: This is Shareware, not freeware, but I'm a big fan of the developer and he does list a portable version.
---
I went looking to see if this program could identify duplicate text and it does (and works great) but only finds duplicate lines, meaning if even one character is different between two lines, it won't notice.
There are options to either skip a few characters or truncate a line (e.g. if you used both you could identify every sentence where the second word), as well as a check using regular expressions. But neither goes to the problem that started this thread, which was finding duplicate content hiding in plain sight.
Dupli Find
https://rlvision.com/dupli/about.php
NOTE: This is Shareware, not freeware, but I'm a big fan of the developer and he does list a portable version.
---
I went looking to see if this program could identify duplicate text and it does (and works great) but only finds duplicate lines, meaning if even one character is different between two lines, it won't notice.
There are options to either skip a few characters or truncate a line (e.g. if you used both you could identify every sentence where the second word), as well as a check using regular expressions. But neither goes to the problem that started this thread, which was finding duplicate content hiding in plain sight.
Dupli Find
https://rlvision.com/dupli/about.php
Re: [Resolved] Finding duplicate phrases
So I found some freeware that will look for duplicate words or phrases across many files: wordTabulator.
Here's an example output from some of Nirs0ft's readme files and the different words and phrases that re-appear across the files. The links show which files have the phrase and how many times it repeats individually.
I just tested it with TXT files, but should also work with HTML.
What would I use this for?
This could be used to help you grab key terms and phrases, create boilerplate documents, over-referencing (plagarism), and word repetition/overuse. Really cool little program.
Recommendation:
The Interface takes some getting used to. For your first run:
---
License: GPLv2
Web pages:
http://wordtabulator.sourceforge.net/
https://sourceforge.net/projects/wordtabulator/
https://www.softpedia.com/get/Office-to ... ator.shtml
Status: Not portable, couldn't seem to get it to run outside of the Program Files direcory
Does not appear to be in active development.
Here's an example output from some of Nirs0ft's readme files and the different words and phrases that re-appear across the files. The links show which files have the phrase and how many times it repeats individually.
I just tested it with TXT files, but should also work with HTML.
What would I use this for?
This could be used to help you grab key terms and phrases, create boilerplate documents, over-referencing (plagarism), and word repetition/overuse. Really cool little program.
Recommendation:
The Interface takes some getting used to. For your first run:
- Set the to .TXT or .HTM or whatever (should be text only)
- Add a few files in this defined format
- Click the Run button
- View the results
---
License: GPLv2
Web pages:
http://wordtabulator.sourceforge.net/
https://sourceforge.net/projects/wordtabulator/
https://www.softpedia.com/get/Office-to ... ator.shtml
Status: Not portable, couldn't seem to get it to run outside of the Program Files direcory
Does not appear to be in active development.
Re: [Resolved] Finding duplicate phrases
Found a program that seems expressly focused on finding duplicate content across two groups of content ala plagiarism: Duplicate Text Finder.
You could add a lot of literature in one folder (the authority) and see if any thing borrowed shows up in the second folder (the edits). Will check a minimum of 20 word groups at a time, so this won't find short quotes like "the only thing we have to fear is fear itself".
Appears to support pure text-only, so you can't have it check DOC or PDF files. No longer in development or even mentioned by the author website.
Homepage: https://web.archive.org/web/20161222201 ... tf/dtf.htm
Softpedia: https://www.softpedia.com/get/File-mana ... nder.shtml
Screenshot
Status: Portable, writes no settings
You could add a lot of literature in one folder (the authority) and see if any thing borrowed shows up in the second folder (the edits). Will check a minimum of 20 word groups at a time, so this won't find short quotes like "the only thing we have to fear is fear itself".
Appears to support pure text-only, so you can't have it check DOC or PDF files. No longer in development or even mentioned by the author website.
Homepage: https://web.archive.org/web/20161222201 ... tf/dtf.htm
Softpedia: https://www.softpedia.com/get/File-mana ... nder.shtml
Screenshot
Status: Portable, writes no settings