[Resolved] Finding duplicate phrases

Any other tech-related topics
Post Reply
Message
Author
User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

[Resolved] Finding duplicate phrases

#1 Post by webfork »

Problem: I've been working on some really long docs written by a bunch of different people and I keep running across very similar text over and over again. I was wanting to find a way to track this down in a more automatic way and found a few tricks. Years back someone posted about a program that will analyze text files for word frequency. It turns out that NoteTab will also do this (from the menu, select Tools - Text Statistics - Word Frequency) but what I really need is to find phrases. So for example if "laser focused on outcomes" comes up 5x in a short document, you know you need an edit.

Does anyone know of a program for this?

A few programs came close

* Text Deduplicator Plus - Looks portable but only checks lines rather than phrases
* A LibreOffice/OpenOffice trick with a similar limitation.

The various Word VB scripts out there (like this one) sadly aren't working for me.

User avatar
Midas
Posts: 6705
Joined: Mon Dec 07, 2009 7:09 am
Location: Sol3

Re: Finding duplicate phrases

#2 Post by Midas »

From where I stand what you're looking for is a specialized kind of software generally called concordancers (https://en.wikipedia.org/wiki/Concordancer). A decade back I would have some ready suggestions for you but too much time has passed since.

Nevertheless, I hazily recall that you could feed Word for Windows a text list of expressions (one per line) and it would automark every occurrence in a given document in order for an index to be generated...

User avatar
__philippe
Posts: 687
Joined: Wed Jun 26, 2013 2:09 am

Re: Finding duplicate phrases

#3 Post by __philippe »

Selection of free concordance tools offered by Yatsko's Computational Linguistics Laboratory :

Intro :
http://yatsko.zohosites.com/about-us.html

Tools :
http://yatsko.zohosites.com/products.html

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases

#4 Post by webfork »

Thanks Midas and __philippe for these very useful suggestions.

---

This topic took me some time to get back to as a had a bunch of document research and then none for over a year. Anyway, adTAT is a great intro tool and is portable.

Steps:

1. Download and extract the contents of the installation using 7zip
2. Launch adTAT.exe

Portable: yes, saves no settings. Stealth: untested

Uses: text files only, but can open any number of text files for expansive research.

Requires: Java

You can already see it picking out some patterns from a license document on my machine:

Image

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases

#5 Post by webfork »

This may not fit under the umbrella of duplicate phrasing, but it's a related text analysis program that's super simple and with broad appeal when we're all drowning in information.

HR Automation Tool (HRAT)

Background: one of the first instances I ever heard about in document analysis where companies would dig through a host of documents, looking for a few clear keywords. As the program description points out, this can be used for things other than resume analysis such as:
  • Comparing white papers for relevance
  • Checking if proposals actually at discuss specific service requirements.
  • Looking for specific and important terms
Features
  • Works on entire folders and their subfolders
  • Accepts a list of words and then provides results of this collection of terms.
Limitations
  • Somewhat buggy. Had some difficulty selecting the word list option ("Unlimited Skill Set Analysis Mode").
  • This was developed roughly 6 years ago and may or may not function with recent Microsoft Office / PDF version requirements.
  • You're going to want to export the results to Excel. The internal analysis tools are fairly limited.
  • EDIT: Doesn't support DOCX files (development stopped I believe a little before this format was ready). There are a number of easy batch tools in LibreOffice to convert from DOCX to either earlier DOC or TXT formats so that's annoying but not impossible to resolve.

Screenshot

Image

Example output (spreadsheet)

Image

Website

https://sourceforge.net/projects/hrat/
http://www.softpedia.com/get/Office-too ... yzer.shtml (no idea why it has a different name on Softpedia)

Portability

Portable, requires Java

Steps:

1. Download and install to the default location
2. Modify the file HRAutomationTool.ini file to change the following lines:

For the line that starts with "Class Path", replace with: Class Path=.\CVA_Data\CVA.jar;
For the line that starts with "Splash Screen", replace with: Splash Screen=.\CVA_Data\splash_Full.jpg

3. Move to a folder of your choice and launch HRAutomationTool.exe

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases [Resolved]

#6 Post by webfork »

Resolved

In digging around for some of the posts, I solved the duplicate phrases question. Sadly, it was a program I overlooked many years ago:

MatnPardaz - Free Word (and phrase) Frequency Counter

Note that I updated the thread topic to include "and phrase". If that was included initially, I might have found that program sooner.

I still have a lot of work ahead of me looking at the various recommended Concordancer tools, which are a little more intelligent than the MatnPardaz program.

User avatar
Midas
Posts: 6705
Joined: Mon Dec 07, 2009 7:09 am
Location: Sol3

Re: [Resolved] Finding duplicate phrases

#7 Post by Midas »

FTR, dtSearch (https://www.dtsearch.com/) was the best commercial product I have ever tested for all around phrase searching -- when it was still shareware... I believe that if you're a developer, you can request a fully functional evaluation copy at their website.

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: [Resolved] Finding duplicate phrases

#8 Post by webfork »

Midas wrote: Sun Jul 08, 2018 1:41 pm FTR, dtSearch (https://www.dtsearch.com/) was the best commercial product I have ever tested for all around phrase searching -- when it was still shareware... I believe that if you're a developer, you can request a fully functional evaluation copy at their website.
Oh wow ... just the hit highlighting might make my month.

Thanks for that. Again :)

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: [Resolved] Finding duplicate phrases

#9 Post by webfork »

Adding this to the concordance thread, just because I don't suspect there's a wide audience ...

---

WordStatix is a basic document / word analysis with some nice extras:

Steps: Download installation file, run Uniextract2

Resources: RAM: 4.7 M Disk: 4.6 M (less if you delete the PDF manual)

Status: On the fence. It writes mostly incidental settings to appdata. Ignoring those values it might be acceptable.

License: GPL

Sites:
Related links:

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases [Resolved]

#10 Post by webfork »

For those interested, a program that has many of MatnPardaz features and more. Unfortunately it's from a shareware company that's disappeared:

https://web.archive.org/web/20100916094 ... extanz.jsp
https://en.lo4d.com/s/Textanz

EDIT: I haven't tested the program yet so its unclear what happens after the trial period.

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: [Resolved] Finding duplicate phrases

#11 Post by webfork »

Even though this thread has already been resolved, I'm still interested in tools that might go further.

NOTE: This is Shareware, not freeware, but I'm a big fan of the developer and he does list a portable version.

---

I went looking to see if this program could identify duplicate text and it does (and works great) but only finds duplicate lines, meaning if even one character is different between two lines, it won't notice.

Image

There are options to either skip a few characters or truncate a line (e.g. if you used both you could identify every sentence where the second word), as well as a check using regular expressions. But neither goes to the problem that started this thread, which was finding duplicate content hiding in plain sight.

Dupli Find
https://rlvision.com/dupli/about.php

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: [Resolved] Finding duplicate phrases

#12 Post by webfork »

So I found some freeware that will look for duplicate words or phrases across many files: wordTabulator.

Image

Here's an example output from some of Nirs0ft's readme files and the different words and phrases that re-appear across the files. The links show which files have the phrase and how many times it repeats individually.

Image

I just tested it with TXT files, but should also work with HTML.

What would I use this for?

This could be used to help you grab key terms and phrases, create boilerplate documents, over-referencing (plagarism), and word repetition/overuse. Really cool little program.

Recommendation:

The Interface takes some getting used to. For your first run:
  1. Set the to .TXT or .HTM or whatever (should be text only)
  2. Add a few files in this defined format
  3. Click the Run button
  4. View the results
Image

---

License: GPLv2

Web pages:

http://wordtabulator.sourceforge.net/
https://sourceforge.net/projects/wordtabulator/
https://www.softpedia.com/get/Office-to ... ator.shtml

Status: Not portable, couldn't seem to get it to run outside of the Program Files direcory

Does not appear to be in active development.

User avatar
webfork
Posts: 10818
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: [Resolved] Finding duplicate phrases

#13 Post by webfork »

Found a program that seems expressly focused on finding duplicate content across two groups of content ala plagiarism: Duplicate Text Finder.

You could add a lot of literature in one folder (the authority) and see if any thing borrowed shows up in the second folder (the edits). Will check a minimum of 20 word groups at a time, so this won't find short quotes like "the only thing we have to fear is fear itself".

Appears to support pure text-only, so you can't have it check DOC or PDF files. No longer in development or even mentioned by the author website.

Homepage: https://web.archive.org/web/20161222201 ... tf/dtf.htm
Softpedia: https://www.softpedia.com/get/File-mana ... nder.shtml

Screenshot

Image

Status: Portable, writes no settings

Post Reply