csvtk - command-line CSV/TSV Tool Kit for manipulating table data (instead of spreadsheet)

Submit command line tools that you find here.
Post Reply
Message
Author
User avatar
rbon
Posts: 344
Joined: Wed Mar 28, 2007 10:16 am
Location: Italy

csvtk - command-line CSV/TSV Tool Kit for manipulating table data (instead of spreadsheet)

#1 Post by rbon »

csvtk (CSV Tool Kit) - A cross-platform, efficient and practical CSV/TSV toolkit. It's a command-line tool.
Current release is 0.21.0 of October 30, 2020.

• Documents: http://bioinf.shenwei.me/csvtk/
• Usage: http://bioinf.shenwei.me/csvtk/usage/
• Tutorial: http://bioinf.shenwei.me/csvtk/tutorial/

From author site on GitHub:
Similar to FASTA/Q format in field of Bioinformatics, CSV/TSV formats are basic and ubiquitous file formats in both Bioinformatics and data science.
People usually use spreadsheet software like MS Excel to process table data. However this is all by clicking and typing, which is not automated and is time-consuming to repeat, especially when you want to apply similar operations with different datasets or purposes.
You can also accomplish some CSV/TSV manipulations using shell commands, but more code is needed to handle the header line. Shell commands do not support selecting columns with column names either.
csvtk is convenient for rapid data investigation and also easy to integrate into analysis pipelines. It could save you lots of time in (not) writing Python/R scripts.
Features
• Cross-platform (Linux/Windows/Mac OS X/OpenBSD/FreeBSD)
• Light weight and out-of-the-box, no dependencies, no compilation, no configuration
• Fast, multiple-CPUs supported (some commands)
• Practical functions provided by N subcommands
• Support STDIN and gziped input/output file, easy being used in pipe
• Most of the subcommands support unselecting fields and fuzzy fields, e.g. -f "-id,-name" for all fields except "id" and "name", -F -f "a.*" for all fields with prefix "a."
• Support some common plots (see usage down in this page)

csvtk Help
from command-line type: csvtk -h
if you like help in a text file, type: csvtk -h>csvtk_help.txt

Subcommands
45 subcommands in total.
If you go on program page on GitHub and click on one subcommand, you you are brought to the author's page where is the command with examples of use (for example): https://bioinf.shenwei.me/csvtk/usage/#freq for the subcommand 'freq'.

Code: Select all

Information
• headers: prints headers
• dim: dimensions of CSV file
• nrow: print number of records
• ncol: print number of columns
• summary: summary statistics of selected digital fields (groupby group fields)
• watch: online monitoring and histogram of selected field
• corr: calculate Pearson correlation between numeric columns

Format conversion
• pretty: converts CSV to readable aligned table
• csv2tab: converts CSV to tabular format
• tab2csv: converts tabular format to CSV
• space2tab: converts space delimited format to CSV
• transpose: transposes CSV data
• csv2md: converts CSV to markdown format
• csv2json: converts CSV to JSON format
• xlsx2csv: converts XLSX to CSV format

Set operations
• head: prints first N records
• concat: concatenates CSV/TSV files by rows
• sample: sampling by proportion
• cut: selects parts of fields
• grep: greps data by selected fields with patterns/regular expressions
• uniq: unique data without sorting
• freq: frequencies of selected fields
• inter: intersection of multiple files
• filter: filters rows by values of selected fields with arithmetic expression
• filter2: filters rows by awk-like arithmetic/string expressions
• join: join files by selected fields (inner, left and outer join)
• split splits CSV/TSV into multiple files according to column values
• splitxlsx: splits XLSX sheet into multiple sheets according to column values
• collapse: collapses one field with selected fields as keys
• comb: compute combinations of items at every row

Edit
• add-header: add column names
• del-header: delete column names
• rename: renames column names with new names
• rename2: renames column names by regular expression
• replace: replaces data of selected fields by regular expression
• round: round float to n decimal places
• mutate: creates new columns from selected fields by regular expression
• mutate2: creates new column from selected fields by awk-like arithmetic/string expressions
• sep: separate column into multiple columns
• gather: gathers columns into key-value pairs

Ordering
• sort: sorts by selected fields

Plotting
plot see usage [url]http://bioinf.shenwei.me/csvtk/usage/#plot[/url]
• plot hist histogram
• plot box boxplot
• plot line line plot and scatter plot

Misc
• cat stream file and report progress
• version print version information and check for update
• genautocomplete generate shell autocompletion script
Links:
• csvtk - CSV/TSV Toolkit (Author's page): https://bioinf.shenwei.me/csvtk/
• GitHub project page: https://github.com/shenwei356/csvtk
• GitHub download page: https://github.com/shenwei356/csvtk/rel ... ag/v0.21.0
• direct download release 0.21.0 for win32: https://github.com/shenwei356/csvtk/rel ... exe.tar.gz
• direct download release 0.21.0 for win64: https://github.com/shenwei356/csvtk/rel ... exe.tar.gz

Note_1:
File name for win32: 'csvtk_windows_386.exe.tar.gz' size 7.95 MB - unzipped 16.711 MB
File name for win64: 'csvtk_windows_amd64.exe.tar.gz' size 8.17 MB - unzipped 18.885 MB
(you must unzip these files for obtain CLI *.exe programs)

Note_2:
the author suggest for Windows users to copy 'csvtk.exe' to 'C:\WINDOWS\system32'.
Why ? Because 'C:\WINDOWS\system32' is in system PATH and you can run program from any directory. But the easy way to add a PATH to your system is use the program 'AdvancedRun' by NirSoft https://www.portablefreeware.com/index.php?id=2734 for run (launch) 'csvtk'.
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning." - Rick Cook.

User avatar
Midas
Posts: 6366
Joined: Mon Dec 07, 2009 3:09 pm
Location: Sol3

Re: csvtk - command-line CSV/TSV Tool Kit for manipulating table data (instead of spreadsheet)

#2 Post by Midas »

Quick note about Note_2 to point out that '%WINDIR%' -- wherever it resides (usually 'C:\Windows\') -- has the same advantages and is a tad less security sensitive. It also contains fewer files, normally.

User avatar
rbon
Posts: 344
Joined: Wed Mar 28, 2007 10:16 am
Location: Italy

Re: csvtk - command-line CSV/TSV Tool Kit for manipulating table data (instead of spreadsheet)

#3 Post by rbon »

Midas wrote: Fri Jan 01, 2021 5:35 pm Quick note about Note_2 to point out that '%WINDIR%' -- wherever it resides (usually 'C:\Windows\') -- has the same advantages and is a tad less security sensitive. It also contains fewer files, usually.
Thanks Midas, i didn't know.
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning." - Rick Cook.

User avatar
rbon
Posts: 344
Joined: Wed Mar 28, 2007 10:16 am
Location: Italy

Re: csvtk - command-line CSV/TSV Tool Kit for manipulating table data (instead of spreadsheet)

#4 Post by rbon »

csvtk (CSV Tool Kit) rel. 0.25.0 released at September 22, 2022

Links:
• csvtk - CSV/TSV Toolkit (Author's page): https://bioinf.shenwei.me/csvtk/
• GitHub project page: https://github.com/shenwei356/csvtk
• GitHub download page: https://github.com/shenwei356/csvtk/rel ... ag/v0.25.0
• direct download release 0.25.0 for win32: https://github.com/shenwei356/csvtk/rel ... exe.tar.gz
• direct download release 0.25.0 for win64: https://github.com/shenwei356/csvtk/rel ... exe.tar.gz
• Readme.md: https://github.com/shenwei356/csvtk#readme

Changes in rel. 0.25.0
○ csvtk: report empty files.
○ csvtk join: fix loading file with no records.
○ csvtk filter2/muate2:
-->● support variable format of ${var} with special charactors including commas, spaces, and parentheses, e.g., ${a,b}, ${a b}, or ${a (b)}. #186
○ csvtk sort: fix checking non-existed fileds.
○ csvtk plot box/hist/line: new flag --skip-na for skipping missing data. #188
○ csvtk csv2xlsx: stores number as float. #192
○ csvtk summary: new functions argmin and argmax. #181

There are GUIs for this Command Line program ?
No, they don't exist and the author is against creating GUIs.
..but a user (https://github.com/shenwei356/csvtk/issues/169) has created a script in AutoHotkey to build a GUI for a specific query.

Here is the AutoHotkey script that you can call 'csvtk_query.ahk' (must be compiled to exe).

Code: Select all

; duplicate key exclusion with csvtk
; works with csv as well as csv.gz files
; csvtk should be in your environment path

#NoEnv
#SingleInstance, ignore

Gui, font, s9 bold, Tahoma

Gui, Add, Text, x21 y7 w400 r1 , Path of csv or csv.gz file:
Gui, Add, Edit, x21 w424 r2 vCsvFilePath

Gui, Add, Text, x21 w400 r1 , Duplicate Columns List (comma separated, without spaces):
Gui, Add, Edit, x21 w424 r2 vDupColsList

Gui, Add, Text, x21 w400 r1 , Unique Column:
Gui, Add, Edit, x21 w424 r2 vUniqueColumn

Gui, Add, Button, x21 w424 h28 , &Result

Gui, Show, x10 y100 w466 h234, Duplicate Key Exclusion with CSVTK
return

GuiClose:
GuiEscape:
ExitApp

; file drop support for edit box
GuiDropFiles:
	; if only one file is dropped into first edit box, set file path
	; do nothing if more than one files are dropped
	if (A_GuiControl = "CsvFilePath" and A_EventInfo = 1)
		guiControl,, CsvFilePath, %A_GuiEvent%
	
	; if file is dropped anywhere else, don't bother
return

ButtonResult:

    ; get variables from GUI
	guiControlGet, csvFile,, CsvFilePath
	guiControlGet, dupCols,, DupColsList
   	guiControlGet, uniqueCol,, UniqueColumn

    ; start processing

    IfNotExist, %csvFile%
    {
        MsgBox, Input file does not exist
        Exit
    }
        
    interimResultFile = interimResult.csv.gz

    SplitPath csvFile, inputFile, workingDir,, inputFileNoExt
    RegExMatch(inputFile, ".*(?=.csv)", inputFileNoExt) ; will work for csv as well as csv.gz files

    SetWorkingDir, %workingDir%

    ; Step 1
    ;cL = bat "%inputFile%" | csvtk uniq -f %dupCols%,%uniqueCol% | csvtk freq -f %dupCols% | csvtk filter2 -f "$frequency > 1" | csvtk cut -f -frequency -o %interimResultFile%
    
    ; removed bat from above command line to be able to use this code with csv as well as csv.gz files
    cL = csvtk uniq -f %dupCols%,%uniqueCol% "%inputFile%" | csvtk freq -f %dupCols% | csvtk filter2 -f "$frequency > 1" | csvtk cut -f -frequency -o %interimResultFile%
    RunWait, %comspec% /c %cL%, %workingDir%, Hide

    ; Step 2
    cL = csvtk join -f "%dupCols%" %interimResultFile% %inputFile% | csvtk uniq -f %dupCols%,%uniqueCol% | csvtk sort -k %dupCols%,%uniqueCol% -o %inputFileNoExt%_dupKeEx_result.csv.gz
    RunWait, %comspec% /c %cL%, %workingDir%, Hide

    ; Remove intermediate file
    IfExist, %interimResultFile%
        FileDelete, %interimResultFile%

    MsgBox, dupKeyEx operation complete

return
... ant this is the GUI from .ahk file compiled to exe:



So You can tailor this query to your need...
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning." - Rick Cook.

Post Reply