[Resolved] Finding duplicate phrases

Any topic that does not fit into the other categories.
Post Reply
Message
Author
User avatar
webfork
Posts: 7792
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

[Resolved] Finding duplicate phrases

#1 Post by webfork » Mon Apr 24, 2017 6:57 pm

Problem: I've been working on some really long docs written by a bunch of different people and I keep running across very similar text over and over again. I was wanting to find a way to track this down in a more automatic way and found a few tricks. Years back someone posted about a program that will analyze text files for word frequency. It turns out that NoteTab will also do this (from the menu, select Tools - Text Statistics - Word Frequency) but what I really need is to find phrases. So for example if "laser focused on outcomes" comes up 5x in a short document, you know you need an edit.

Does anyone know of a program for this?

A few programs came close

* Text Deduplicator Plus - Looks portable but only checks lines rather than phrases
* A LibreOffice/OpenOffice trick with a similar limitation.

The various Word VB scripts out there (like this one) sadly aren't working for me.
Supporting Net Neutrality - BattleForTheNet | Why this matters | More from EFF.org

User avatar
Midas
Posts: 4260
Joined: Mon Dec 07, 2009 7:09 am
Location: Sol3

Re: Finding duplicate phrases

#2 Post by Midas » Tue Apr 25, 2017 3:59 am

From where I stand what you're looking for is a specialized kind of software generally called concordancers (https://en.wikipedia.org/wiki/Concordancer). A decade back I would have some ready suggestions for you but too much time has passed since.

Nevertheless, I hazily recall that you could feed Word for Windows a text list of expressions (one per line) and it would automark every occurrence in a given document in order for an index to be generated...

User avatar
__philippe
Posts: 483
Joined: Wed Jun 26, 2013 2:09 am

Re: Finding duplicate phrases

#3 Post by __philippe » Tue Apr 25, 2017 5:51 am

Selection of free concordance tools offered by Yatsko's Computational Linguistics Laboratory :

Intro :
http://yatsko.zohosites.com/about-us.html

Tools :
http://yatsko.zohosites.com/products.html

User avatar
webfork
Posts: 7792
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases

#4 Post by webfork » Sat Jun 30, 2018 6:46 am

Thanks Midas and __philippe for these very useful suggestions.

---

This topic took me some time to get back to as a had a bunch of document research and then none for over a year. Anyway, adTAT is a great intro tool and is portable.

Steps:

1. Download and extract the contents of the installation using 7zip
2. Launch adTAT.exe

Portable: yes, saves no settings. Stealth: untested

Uses: text files only, but can open any number of text files for expansive research.

Requires: Java

You can already see it picking out some patterns from a license document on my machine:

Image
Supporting Net Neutrality - BattleForTheNet | Why this matters | More from EFF.org

User avatar
webfork
Posts: 7792
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases

#5 Post by webfork » Sun Jul 08, 2018 11:12 am

This may not fit under the umbrella of duplicate phrasing, but it's a related text analysis program that's super simple and with broad appeal when we're all drowning in information.

HR Automation Tool (HRAT)

Background: one of the first instances I ever heard about in document analysis where companies would dig through a host of documents, looking for a few clear keywords. As the program description points out, this can be used for things other than resume analysis such as:
  • Comparing white papers for relevance
  • Checking if proposals actually at discuss specific service requirements.
  • Looking for specific and important terms
Features
  • Works on entire folders and their subfolders
  • Accepts a list of words and then provides results of this collection of terms.
Limitations
  • Somewhat buggy. Had some difficulty selecting the word list option ("Unlimited Skill Set Analysis Mode").
  • This was developed roughly 6 years ago and may or may not function with recent Microsoft Office / PDF version requirements.
  • You're going to want to export the results to Excel. The internal analysis tools are fairly limited.
Screenshot

Image

Example output (spreadsheet)

Image

Website

https://sourceforge.net/projects/hrat/
http://www.softpedia.com/get/Office-too ... yzer.shtml (no idea why it has a different name on Softpedia)

Portability

Portable, requires Java

Steps:

1. Download and install to the default location
2. Modify the file HRAutomationTool.ini file to change the following lines:

For the line that starts with "Class Path", replace with: Class Path=.\CVA_Data\CVA.jar;
For the line that starts with "Splash Screen", replace with: Splash Screen=.\CVA_Data\splash_Full.jpg

3. Move to a folder of your choice and launch HRAutomationTool.exe
Supporting Net Neutrality - BattleForTheNet | Why this matters | More from EFF.org

User avatar
webfork
Posts: 7792
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases [Resolved]

#6 Post by webfork » Sun Jul 08, 2018 12:44 pm

Resolved

In digging around for some of the posts, I solved the duplicate phrases question. Sadly, it was a program I overlooked many years ago:

MatnPardaz - Free Word (and phrase) Frequency Counter

Note that I updated the thread topic to include "and phrase". If that was included initially, I might have found that program sooner.

I still have a lot of work ahead of me looking at the various recommended Concordancer tools, which are a little more intelligent than the MatnPardaz program.
Supporting Net Neutrality - BattleForTheNet | Why this matters | More from EFF.org

User avatar
Midas
Posts: 4260
Joined: Mon Dec 07, 2009 7:09 am
Location: Sol3

Re: [Resolved] Finding duplicate phrases

#7 Post by Midas » Sun Jul 08, 2018 1:41 pm

FTR, dtSearch (https://www.dtsearch.com/) was the best commercial product I have ever tested for all around phrase searching -- when it was still shareware... I believe that if you're a developer, you can request a fully functional evaluation copy at their website.

User avatar
webfork
Posts: 7792
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: [Resolved] Finding duplicate phrases

#8 Post by webfork » Sun Jul 08, 2018 2:59 pm

Midas wrote:
Sun Jul 08, 2018 1:41 pm
FTR, dtSearch (https://www.dtsearch.com/) was the best commercial product I have ever tested for all around phrase searching -- when it was still shareware... I believe that if you're a developer, you can request a fully functional evaluation copy at their website.
Oh wow ... just the hit highlighting might make my month.

Thanks for that. Again :)
Supporting Net Neutrality - BattleForTheNet | Why this matters | More from EFF.org

Post Reply