CLI Database Discussions

Discuss anything related to command line tools here.
Message
Author
vevy
Posts: 228
Joined: Tue Sep 10, 2019 11:17 am

Re: CLI Database Discussions

#181 Post by vevy » Thu May 21, 2020 7:10 pm

Thanks a lot for your response. It is clear and to the point, which I appreciate. You should try to poke as many holes as you can in what I propose so that we all (including myself) can see if it can hold.

A few quick remark about alternativeto.net:
  • You are a bit harsher on it than I would be! :) From a user's perspective, they are my go-to place for "related" software.
  • Whether it's the tags or some algorithm behind the scene, their related software work more often than not for me, especially when tried in multiple iterations (related, then related of related). I discovered a lot of gems through that system.
  • Also their tags largely have unified terminology; meaning there is only one phrasing used consistently as tag for a particular feature.
  • Our "Similar/alternative apps": is the kind of feature that requires the a whole lot of maintenance and manual work beyond the occasional use, which is what I want to avoid through systematic relations. It also works for largely overlapping apps rather than intersecting ones. It is also kind of cryptic (in what way are they similar?) and subjective on what level of similarity warrants it.
Andrew Lee wrote:
Thu May 21, 2020 3:50 pm
... tries to anticipate all potential search queries.
..."extract video from mpeg without quality loss". We need an index for that.
It's never-ending manual work to maintain the index...
If that is what comes across, I realize know I should have been clearer. Please, bear with me as I try to explain the system that I have in mind:
  • My target audience is someone like me (and, I believe, many tech-minded people): interested in finding a piece software can do a certain job, but also wants to get their pick of the litter, rather than install the first "youtube mp3 downloader" they find on Google. I want to make it easier to find these tools.
  • I also expect that user to be able to do a little bit of the work themselves.
  • "extract video from mpeg without quality loss" is something I would never bother adding even if I had the time:
    1. It is not one use case. It is multiple. A tool that does that would be tagged: ("transcode video" = "convert video to video"), ("mux video" = "convert video to video losslessly"),("video to mpeg"), ("convert to mpeg"; which would also cover images to mpeg), ("avi to mpeg"), ("mp4 to mpeg"), ("flv to mpeg"), etc.
    2. As I expect the user to have a minimum of tech sense, and as they are searching for CLI and on our site, I would not plan on (nor care to prepare for) them to use a phrase like "extract video from mpeg", it's not about getting a part from a whole. I would expect something like "convert" at the very least.
    3. See the 2 main reasons for aliases in my previous post, but here, as the user searches for "convert video to mpeg without quality loss", the search engine should pick matches like "convert", "mpeg" (or "mpg"), "loss" and show the relevant tags at the top (my suggestion), and also the entry results with the most relevant tags.
    4. I want to cover the common terms/phrases, not every unique version. I just want to help the user described above find the proverbial thread.
    5. The format variations, like "avi to mpeg" would be in the dozens for all-in-one tools like ffmpeg, but I believe that the problem they may pose is not mainly the effort to add them, but the presentation; i.e. hiding them in lists, but showing them if they have a search hit, in the entry's page, under a "more" button, whatever it may be.
if something changes, some entries break and need to be verified/maintained.
Just curious, how do you see that happening? It is not a rhetorical question. I know things can break. I just want to know what you have in mind.
You can argue the index doesn't need to be complete to be useful. But the fact is, if it's only 20%~30% complete, it won't be very useful. And that 20% to 30% is already a _ton_ of work.
Other than it not being an index of tags as I visualize it, I guess I am aiming at 80-90 percent by breaking things into manageable units rather than doing all the variations of tags together. I don't know how our search engine works, but if it indexes keywords from the database or performs direct search, then weighs the results, it will be good enough, I think. I do believe it should do partial word matches though. :) That will make things easier (like with "losslessly").

What you have in mind could potentially work if the search domain is small, static and exhaustively maintained. That is why I brought up expert systems in my previous post. It turned out to have very limited application precisely because of that. Very few real world applications fall into this category. New information comes in all the time, and it became very costly and laborious for expert systems to remain updated and relevant.
I have to say I do believe that CLI tools fall comfortably enough within what you describe. The pool for Windows is comparatively small and relatively stable (if not fully static). See post #175 above for why I believe that. I wouldn't tackle such a project with even 10% of Softpedia's database for example.

"This idea will never scale."
I agree in part :) . See my previous point. But also, the categories and common use cases don't grow that much with time. The tools may. If we did the framework systematically and adopted a DRY approach (for example, see the last few paragraphs of my post #175 above), changing things en masse should be fairly easy or at least manageable in the vast majority of cases.
I think a more fruitful approach will be a better search engine.
I would actually like that. I mean, I am invested in this project and I wouldn't presume to tell you what to do with your effort on your site, but I wouldn't say no to both :mrgreen:.

Wall of text over!
I do NOT have other accounts.

User avatar
Andrew Lee
Posts: 2355
Joined: Sat Feb 04, 2006 9:19 am
Contact:

Re: CLI Database Discussions

#182 Post by Andrew Lee » Fri May 22, 2020 9:06 pm

About "alternativeto.net", I actually went a little further this morning, signed up for an account and poked around. If any of you have any insider information, I am all ears. But here are some of my thoughts:

- I don't think what you see is 100% crowd-sourced. There's definitely some kind of algo behind the scenes that takes user input and text/tag analysis to produce the clustering of similar software.

- Same with the tags. What you input is not immediately accepted, but goes into the backend for processing, probably with a combination of manual input and algorithmic processing. It's a blackbox as far as I can tell, not some transparent moderator approval process.

- Going by the activity in their forum, I'd be very surprised if crowd-sourced data forms the majority of their input. I am guessing data-scraping and text analysis play a bigger role.

- - - - - - - - - -

@vevy: I am very confused about 2 points from your arguments that keep coming up.

1. From your description of the use-cases that you would assign to a tool, it seems precisely the kind of micro-management that myself and others feel will never be feasible. Yet, you argue it's not, while continuing to cite examples that appear to contradict that view. Very confusing! Maybe if you could exhaustively list all the "hundreds" of use-cases that you would assign to just _one_ tool eg. ffmpeg, we could have a better basis for further discussion.

2. It seems that you think the database/tags for the CLI tools will be small and somewhat static. Have you considered that a dynamic database with fully crowd-sourced data and its complicated approval process like TPFC is not a good fit for the requirement? IMHO, a simple Microsoft Access -like database edited by 1 or 2 persons would be a better fitting tool.

- - - - - - - - - -
Just curious, how do you see that happening? It is not a rhetorical question. I know things can break. I just want to know what you have in mind.
A common one would be format/protocol changes/removal due to patent/security issues. Of course, you could argue that such changes don't occur very often, or it would be an easy update, or surely we don't have to put said format/protocol into the use-case etc. But that would be missing the point.

(Some examples that come to mind would be certain patented image formats like GIF, JPEG2000 etc. Crypto algorithms that used to be supported by SSH but since deprecated etc.)

I truly see a lot of similarities with expert systems (disclaimer: I used to be a research student back in the days). The solution to any problem would be "let's add more rules", or "let's change some rules", ad infinitum. No one ever steps back and ask, "Is this the right tool for the job?" :D
I would actually like that. I mean, I am invested in this project and I wouldn't presume to tell you what to do with your effort on your site, but I wouldn't say no to both
Now if only one of the tech giants would contact me and pour some funding into my bank account, I would be glad to assemble a team to tackle the research and implementation 8) Meanwhile, the cheaper alternative is to use the Google custom search engine!

Post Reply