In short, having a consistent set of descriptive, clearly-named files and a variety of content could serve as a way to:
- Test claims of compatibility with publicly available examples. I'm constantly finding issues with files at work that I can't share. If I could find a public repository of similar files and point developers to it, that would take down a barrier to use.
- General bug testing - Give the tools to check for bugs that would normally only appear after weeks and months of use.
- Benchmarking - Could be used to test compression / transfer tools thanks to a wide variety of data
So where's this resource?
Unfortunately, it wasn't long into working on this that I realized I'd bit off WAY more than I could chew. I could easily spend 10 hours a week for the next year and only take a bite out out of what would be a very useful set of testing resources.
Also, it's entirely possible someone's already built something along these lines. In any case, I'm posting it here in the event that someone else can either give suggestions on how to proceed, point me to someone that's already doing it, or take the idea and run with it. I may still pick this project up and move forward with it if there's time and interest, but for now I'll just leave this here.
Potentially useful file types
Various data that could help in testing.
- Text files that are slightly different (minor word differences), somewhat different (moved data), and very different with only a few similarities. Also: content that's dramatically rearranged (all paragraphs in a different order)
- Wide variety of audio formats with numerous tags applied
- Wide variety of new and old video files
- Photography with various different metadata
- Files that are misnamed or mislabeled with incorrect content (should have a set that are just completely wrong names)
- Old and new filetypes (e.g. Word 6.0, Word 2003, Word 2016, etc.)
- Old filetypes e.g. Lotus 123 and BMP files
- Unusual compression files (e.g. .xz)
- Hashed files of a set folder/directory (e.g. .SFV)
- High compression and low compression files
- password protection (old doc files, docx, PDF, etc.)
- Write-protected files
- Generated by HTML tools e.g. Word, Dreamweaver, Kompozer, LibreOffice, etc)
- Text files with various types of generated data including phone numbers, SSNs, addresses,
- PDFs - various formats, with layered graphics, without, hidden, etc.
- Non-latin, asian characters, etc.
Some of the things we'd try to include in every file (where possible)
- Some standard text explaining where it came from and why (maybe with the project intro)
- Some note about what the file is and where it was used, and why it might be interesting to test (this would probably be the meat of the effort)
- A block of unique information (probably a paragraph of some generated text content)
Other possible benefits
Some other ways this resource could come in handy:
- Check whether indexing search programs are able to find a given tool.
- A variety of commercial software dumps anything resembling support for old versions of it's files (MS Works is the worst about this).
- Compression tools get better over time, sure - but what about testing a variety of file types with your compression program and seeing how it's improved?
- Developers frequently add some kind of toolset to open or modify a given file type and don't really know how to underline this fact. Yet it can be a lifesaver for the right person.
- Create screencaps that feature our site
A great example of testing resources: all the important details are right in the file name: http://download.opencontent.netflix.com ... mera/AVIF/