Useful Shell Commands for Data Science
Foreword
Code snippets and excerpts from the tutorial. bash. From DataCamp.
Quick notes¶
Use bash or alternative such as zsh.
Examples are based on the adult dataset from the UCI Machine Learning repository (Census Income dataset). This data set is commonly used to predict whether income exceeds $50K/yr based on census data. With 48842 rows and 14 attributes.
Move to the directory with cd <dir>
and print the current working directory with pwd
. Move up with cd ..
.
Count with wc
¶
# count lines
wc -l adult.data
1 |
|
# count words
wc -w adult.data
1 |
|
ls -l
1 2 3 4 5 |
|
ls -l folder
1 2 3 |
|
# count files
ls -l folder | wc -l
1 |
|
# print head (10 by default or -n)
head -n 2 adult.data
1 2 |
|
Concatenate with cat
¶
- Print a file content with
cat adult.data
. - Concatenate files and create (replace) a file with
>
.>>
will appends.
cat adult.data adult.data > target_file.csv
Add the header to the original file.
echo "age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,class" > header.csv
Add the header and rename the file.
cat header.csv adult.data > adult.csv
Check the first and last row (the default is 10 lines unless specified otherwise).
head -n 1 adult.csv
1 |
|
tail -n 2 adult.csv
1 |
|
Modify with sed
¶
When a file is corrupted or badly formatted, with no UTF-8 characters or misplaced comma.
sed "s/<string to replace>/<string to replace it with>/g" <source_file> > <target_file>.
Replace ?
for missing value with NaN
.
First, count the instances.
grep ", ?," adult.csv | wc -l
1 |
|
Second,
- replace all the columns with
?
…"s/<string to replace>/"
.
- by an empty string…
"/<string to replace it with>/g"
. Use column delimiter,
.
sed "s/, ?,/,,/g" adult.csv > adult_v2.csv
Subset¶
Large file (30M rows and more). Sample the head or the tail.
Extract the head (120 lines).
head -n 120 adult_v2.csv > adult_v3.csv
Extract the tail (12 lines).
tail -n 12 adult_v2.csv > adult_v4.csv
Extract 20 lines starting at line 100.
head -n 120 adult_v2.csv | tail -n 20 > adult_sample.csv
Add the header to the file without.
cat header.csv adult_v4.csv > adult_v4_with_header.csv
cat header.csv adult_sample.csv > adult_sample_with_header.csv
Find duplicates with uniq
¶
Find adjacent repeated lines in a file.
uniq -c
adds the repetition count to each line.uniq -d
only outputs duplicate lines.uniq -u
only outputs unique lines.
First, sort the file to bunch duplicates together. Second, count the duplicates.
sort adult_v2.csv | uniq -d | wc -l
1 |
|
Third, sort the file again, find the duplicates, sort the results in reverse, and output the first 3 duplicates.
sort adult_v2.csv | uniq -c | sort -r | head -n 3
1 2 3 |
|
Select columns with cut
¶
Select a particular column. -d
specifies the column delimiter and -f
specifies the columns.
Find the number of unique values taken by the categorical variable workclass
(2nd column of the file) and print the head of the results.
cut -d "," -f 2 adult_v2.csv | head -3
1 2 3 |
|
Repeat, but this time, sort the results and find the duplicates.
cut -d "," -f 2 adult_v2.csv | sort | uniq -c
1 2 3 4 5 6 7 8 9 10 |
|
Loop¶
Work of several files with loops.
Replace one character with another within all files inside a directory. Here is how to declare variables and call variables.
varname1=10
varname2='hello'
varname3=123.4
echo $varname1
1 |
|
echo $varname2
1 |
|
echo $varname3
1 |
|
First, declare two variables. Second, loop through the folder with for
. Third, replace the character. Finally, for each file, create a new file.
replace_source=' '
replace_target='_'
for filename in ./*.csv; do
new_filename=${filename//$replace_source/$replace_target}
mv "$filename" "$new_filename"
done
With the while
loop. However, for
loops are faster.
replace_source=' '
replace_target='_'
while true; do
new_filename=${filename//$replace_source/$replace_target}
mv "$filename" "$new_filename"
done