[Resolved] Finding duplicate phrases

Any topic that does not fit into the other categories.
Post Reply
Message
Author
User avatar
webfork
Posts: 8235
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

[Resolved] Finding duplicate phrases

#1 Post by webfork » Mon Apr 24, 2017 6:57 pm

Problem: I've been working on some really long docs written by a bunch of different people and I keep running across very similar text over and over again. I was wanting to find a way to track this down in a more automatic way and found a few tricks. Years back someone posted about a program that will analyze text files for word frequency. It turns out that NoteTab will also do this (from the menu, select Tools - Text Statistics - Word Frequency) but what I really need is to find phrases. So for example if "laser focused on outcomes" comes up 5x in a short document, you know you need an edit.

Does anyone know of a program for this?

A few programs came close

* Text Deduplicator Plus - Looks portable but only checks lines rather than phrases
* A LibreOffice/OpenOffice trick with a similar limitation.

The various Word VB scripts out there (like this one) sadly aren't working for me.

User avatar
Midas
Posts: 4509
Joined: Mon Dec 07, 2009 7:09 am
Location: Sol3

Re: Finding duplicate phrases

#2 Post by Midas » Tue Apr 25, 2017 3:59 am

From where I stand what you're looking for is a specialized kind of software generally called concordancers (https://en.wikipedia.org/wiki/Concordancer). A decade back I would have some ready suggestions for you but too much time has passed since.

Nevertheless, I hazily recall that you could feed Word for Windows a text list of expressions (one per line) and it would automark every occurrence in a given document in order for an index to be generated...

User avatar
__philippe
Posts: 554
Joined: Wed Jun 26, 2013 2:09 am

Re: Finding duplicate phrases

#3 Post by __philippe » Tue Apr 25, 2017 5:51 am

Selection of free concordance tools offered by Yatsko's Computational Linguistics Laboratory :

Intro :
http://yatsko.zohosites.com/about-us.html

Tools :
http://yatsko.zohosites.com/products.html

User avatar
webfork
Posts: 8235
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases

#4 Post by webfork » Sat Jun 30, 2018 6:46 am

Thanks Midas and __philippe for these very useful suggestions.

---

This topic took me some time to get back to as a had a bunch of document research and then none for over a year. Anyway, adTAT is a great intro tool and is portable.

Steps:

1. Download and extract the contents of the installation using 7zip
2. Launch adTAT.exe

Portable: yes, saves no settings. Stealth: untested

Uses: text files only, but can open any number of text files for expansive research.

Requires: Java

You can already see it picking out some patterns from a license document on my machine:

Image

User avatar
webfork
Posts: 8235
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases

#5 Post by webfork » Sun Jul 08, 2018 11:12 am

This may not fit under the umbrella of duplicate phrasing, but it's a related text analysis program that's super simple and with broad appeal when we're all drowning in information.

HR Automation Tool (HRAT)

Background: one of the first instances I ever heard about in document analysis where companies would dig through a host of documents, looking for a few clear keywords. As the program description points out, this can be used for things other than resume analysis such as:
  • Comparing white papers for relevance
  • Checking if proposals actually at discuss specific service requirements.
  • Looking for specific and important terms
Features
  • Works on entire folders and their subfolders
  • Accepts a list of words and then provides results of this collection of terms.
Limitations
  • Somewhat buggy. Had some difficulty selecting the word list option ("Unlimited Skill Set Analysis Mode").
  • This was developed roughly 6 years ago and may or may not function with recent Microsoft Office / PDF version requirements.
  • You're going to want to export the results to Excel. The internal analysis tools are fairly limited.
  • EDIT: Doesn't support DOCX files (development stopped I believe a little before this format was ready). There are a number of easy batch tools in LibreOffice to convert from DOCX to either earlier DOC or TXT formats so that's annoying but not impossible to resolve.

Screenshot

Image

Example output (spreadsheet)

Image

Website

https://sourceforge.net/projects/hrat/
http://www.softpedia.com/get/Office-too ... yzer.shtml (no idea why it has a different name on Softpedia)

Portability

Portable, requires Java

Steps:

1. Download and install to the default location
2. Modify the file HRAutomationTool.ini file to change the following lines:

For the line that starts with "Class Path", replace with: Class Path=.\CVA_Data\CVA.jar;
For the line that starts with "Splash Screen", replace with: Splash Screen=.\CVA_Data\splash_Full.jpg

3. Move to a folder of your choice and launch HRAutomationTool.exe

User avatar
webfork
Posts: 8235
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: Finding duplicate phrases [Resolved]

#6 Post by webfork » Sun Jul 08, 2018 12:44 pm

Resolved

In digging around for some of the posts, I solved the duplicate phrases question. Sadly, it was a program I overlooked many years ago:

MatnPardaz - Free Word (and phrase) Frequency Counter

Note that I updated the thread topic to include "and phrase". If that was included initially, I might have found that program sooner.

I still have a lot of work ahead of me looking at the various recommended Concordancer tools, which are a little more intelligent than the MatnPardaz program.

User avatar
Midas
Posts: 4509
Joined: Mon Dec 07, 2009 7:09 am
Location: Sol3

Re: [Resolved] Finding duplicate phrases

#7 Post by Midas » Sun Jul 08, 2018 1:41 pm

FTR, dtSearch (https://www.dtsearch.com/) was the best commercial product I have ever tested for all around phrase searching -- when it was still shareware... I believe that if you're a developer, you can request a fully functional evaluation copy at their website.

User avatar
webfork
Posts: 8235
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: [Resolved] Finding duplicate phrases

#8 Post by webfork » Sun Jul 08, 2018 2:59 pm

Midas wrote:
Sun Jul 08, 2018 1:41 pm
FTR, dtSearch (https://www.dtsearch.com/) was the best commercial product I have ever tested for all around phrase searching -- when it was still shareware... I believe that if you're a developer, you can request a fully functional evaluation copy at their website.
Oh wow ... just the hit highlighting might make my month.

Thanks for that. Again :)

User avatar
webfork
Posts: 8235
Joined: Wed Apr 11, 2007 8:06 pm
Location: US, Texas
Contact:

Re: [Resolved] Finding duplicate phrases

#9 Post by webfork » Mon Jan 28, 2019 7:15 pm

Adding this to the concordance thread, just because I don't suspect there's a wide audience ...

---

WordStatix is a basic document / word analysis with some nice extras:

Steps: Download installation file, run Uniextract2

Resources: RAM: 4.7 M Disk: 4.6 M (less if you delete the PDF manual)

Status: On the fence. It writes mostly incidental settings to appdata. Ignoring those values it might be acceptable.

License: GPL

Sites:
Related links:

Post Reply