介绍

Missing Semester 是 MIT 开的一门关于高效使用计算机工具的课程,以下是课程介绍

在传统的计算机科学教育中,你有可能会选修大量的课程来教授你CS中的高级主题,从操作系统到编程语言再到机器学习。 但在许多院校中,有一个重要的主题很少被提及,而是让学生自己去学习:计算生态系统知识。

多年来,我们在麻省理工学院帮助教过几门课,一次又一次地看到,很多学生对可用的工具知识点都很有限。 计算机是为自动化手工任务而生的,然而学生们经常用手来完成重复性的任务,或者未能充分利用版本控制和文本编辑等强大的工具。 在最好的情况下,这导致了效率低下和时间的浪费;在最坏的情况下,会导致数据丢失或无法完成某些任务等问题。

这些主题没有作为大学课程的一部分来教授:学生们从来没有被告知如何使用这些工具,或者至少没有被告知如何有效地使用这些工具, 从而将时间和精力浪费在本应简单的任务上。标准的CS课程缺失了关于计算生态系统的关键主题,而这些主题可以让学生的生活变得更加轻松。

为了帮助弥补这个问题,我们正在开设一个课程,涵盖了我们认为成为一名有效的计算机科学家和程序员的关键话题。 这门课是实用性和实践性的,它提供了实践性的工具和技术介绍,你可以立即应用到你会遇到的各种情况。 该课程在2020年1月MIT的 "独立活动期 "期间开设---为期一个月的学期,以学生开设的较短的课程为特色。 虽然讲座本身只对麻省理工学院的学生开放,但我们将向公众提供所有的讲座资料以及讲座的录像。

相见恨晚的感觉,MIT 果然是顶尖名校。实际上这些东西不管是对日后工作还是科研都有巨大的帮助和效率提升,对熟练者来说 这些内容都不值一提,但是对于尤其是刚接触计算机的学生来说,这些东西都非常有价值,中间有比较大的鸿沟需要弥合, 再次感谢和致敬 MIT!他们做了很重要的工作。

课程总共分为几个部分,每个部分都可以看作是独立的讲座,每个部分后面都有对应的联系,下面我写的一些练习解答,工作环境为 macOS Catlina 10.15.5 Beta。

✅ Bash Shell

这一节的重点在理解 *nix 操作系统中关于 的概念,每个 *nix 工具都有输入和输出流,而且输入和输出均为字符串, 而非结构化数据,这一点其实我一直觉得虽然看上去万物归一只需要组合使用,但无形之中对初学者增加了非常高的使用门槛,学习 这些东西并不是因为它设计得有多好,而是在大多数的情况下,这些已经几乎成了默认配置。为了实现目标,只能通过记忆或 查看手册来强行使用,我对 vim 和正则表达式的看法也是一致的。这些在计算机和人类之间的东西,目的是提供人和计算机沟通的 桥梁,如果把它们当作是编程语言来看待,无疑是比较糟糕的。一些人因为掌握了这些东西而觉得高人一等,我觉得很奇怪。

奇怪归奇怪,现实归现实。第一节的练习主要是创建一个内含有命令的文件,调用文件来得到其中的某个字符串并输出到另外一个文件中。

$ echo '#!/bin/sh' > semester

shebang 写入文件,注意 单引号

$ echo 'curl --head --silent https://missing.csail.mit.edu' >> semester

将这句 curl 命令缀入文件。

$ sh semester

执行文件。或者

$ chmod +x semester
$ ./semester

这里首先给这个文件加入可执行权限然后执行。会打印出 http 头部信息。

HTTP/2 200
content-type: text/html; charset=utf-8
server: GitHub.com
x-origin-cache: 1
last-modified: Fri, 08 May 2020 14:56:19 GMT
etag: "5eb57313-1abd"
access-control-allow-origin: *
expires: Sun, 10 May 2020 18:48:31 GMT
cache-control: max-age=600
x-proxy-cache: MISS
x-github-request-id: 806A:29EE:42CEC3:4809A1:5EB84A26
accept-ranges: bytes
date: Mon, 11 May 2020 10:10:46 GMT
via: 1.1 varnish
age: 31
x-served-by: cache-tyo19925-TYO
x-cache: HIT
x-cache-hits: 1
x-timer: S1589191847.514805,VS0,VE0
vary: Accept-Encoding
x-fastly-request-id: 4d434e7eaa01cda25ac4d549ca681645bdb43202
content-length: 6845

提取出想要的信息

$ ./semester | grep -i last-modified | cut -d ':' -f 2-

这里使用了 grepcut 两个工具,具体使用如果不熟悉参数的话可能还需要 man 一下。而且 Linux 和 macOS 的参数可能也有些许不同,如果出现错误,查看文档。

最终把输出重定向到新文件中

$ ./semester | grep -i last-modified | cut -d ' ' -f 2- > ~/last-modified.txt

✅ 文本编辑器:vim

i insert a append o open new line O open new line reverce(up)

w word e end of word $ end of line 0 start of line

y yank yd yw y$ yG ygg p past

ctrl+I ctrl+O

gg G 456G

d dd d$ d0 dw de diw dG dgg

u undo U undo ctrl+R redo

x

:w write/save :w TEST save current file as TEST

:r read from FILE or external command

! external command :r TEST :r !ls :r !ls -al | grep TEST

v visual mode v (hjkl)

v (hjkl) :w FILENAME save visual selected content to FILENAME

:set ic :set noic :set linenum :set hls is :set nocp

ctrl+D help TAB

ctrl+W ctrl+W

(Advanced) Convert XML to JSON (example file) using Vim macros. Try to do this on your own, but you can look at the macros section above if you get stuck.

✅ 数据处理:Data Wrangling

ssh myserver journalctl
    | grep sshd
    | grep "Disconnected from"
    | sed 's/.*Disconnected from //'
ssh myserver journalctl
    | grep sshd
    | grep "Disconnected from"
    | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
ssh myserver journalctl
    | grep sshd
    | grep "Disconnected from"
    | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
    | sort | uniq -c
ssh myserver journalctl
    | grep sshd
    | grep "Disconnected from"
    | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
    | sort | uniq -c
    | sort -nk1,1 | tail -n10
ssh myserver journalctl
    | grep sshd
    | grep "Disconnected from"
    | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
    | sort | uniq -c
    | sort -nk1,1 | tail -n10
    | awk '{print $2}' | paste -sd,
# awk
BEGIN { rows = 0 }
$1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 }
END { print rows }
ssh myserver journalctl
    | grep sshd
    | grep "Disconnected from"
    | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
    | sort | uniq -c
    | awk '{print $1}' | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'
ssh myserver journalctl
    | grep sshd
    | grep "Disconnected from"
    | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
    | sort | uniq -c
    | sort -nk1,1 | tail -n10
    | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'
ffmpeg -loglevel panic -i /dev/video0 -frames 1 -f image2 -
    | convert - -colorspace gray -
    | gzip
    | ssh mymachine 'gzip -d | tee copy.jpg | env DISPLAY=:0 feh -'
Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?

✅ 版本控制:git

  • Snapshot
  • Modeling history: relating snapshot directed acyclic graph (DAG) 有向无环图
  • Data model
// a file is a bunch of bytes
type blob = array<byte>

// a directory contains named files and directories
type tree = map<string, tree | file>

// a commit has parents, metadata, and the top-level tree
type commit = struct {
    parent: array<commit>
    author: string
    message: string
    snapshot: tree
}
  • Objects and content-addressing
type object = blob | tree | commit
objects = map<string, object>

def store(object):
    id = sha1(object)
    objects[id] = object

def load(id):
    return objects[id]
git cat-file -p xxxxxxxx
  • References
references = map<string, string>

def update_reference(name, id):
    references[name] = id

def read_reference(name):
    return references[name]

def load_reference(name_or_id):
    if name_or_id in references:
        return load(references[name_or_id])
    else:
        return load(name_or_id)
  • Repositories
  • Staging area

✅ 关于编程的一些元知识

  • build system 构建系统: make
paper.pdf: paper.tex plot-data.png
    pdflatex paper.tex

plot-%.png: %.dat plot.py
    ./plot.py -i $*.dat -o $@

.PHONY: clean
clean:
    rm *.png *.pdf
git lis-files
.git/hooks
pre-commit hook # refuses the commit if the make command fails

shellcheck
proselint
write-good
  • dependency management
  • semantic versioning 语义版本
  • lock file 锁定版本: vendoring
  • continuous integration 持续集成: dependbot, code coverage, github pages
  • testing 测试: test suit, unit test, integrate test, regression test, mocking

✅ Security and Cryptography 安全和加密的应用

Entropy 熵

Entropy is a measure of randomness. This is useful, for example, when determining the strength of a password. 熵是衡量随机性的一个指标。在确定密码的强度时这一点很有用。

用熵来衡量密码强度的时候需要考虑攻击者可能知道你的密码 模式 ,但是不知道密码的 随机性,模式就是你密码的生成规则。

这就是为什么漫画中的第一种密码比第二种密码熵低(因为都假设攻击者知道模式,而第一种密码的模式更详细),不过实际真实情况可能更复杂。 这里的模式其实就是 generation algorithm ,密码的生成算法。

Suppose a password is chosen as a concatenation of five lower-case dictionary words, where each word is selected uniformly at random from a dictionary of size 100,000. An example of such a password is correcthorsebatterystaple. How many bits of entropy does this have?

Solution:

>>> import math
>>> 5*math.log(100000, 2)
83.04820237218406
Consider an alternative scheme where a password is chosen as a sequence of 8 random alphanumeric characters (including both lower-case and upper-case letters). An example is rg8Ql34g. How many bits of entropy does this have?

Solution:

>>> 8*math.log(26+26+10, 2)
47.633570483095006

The first one is stronger.

Suppose an attacker can try guessing 10,000 passwords per second. On average, how long will it take to break each of the passwords?

Solution:

>>> pow(2, 83)/10000/3600/24/365
30667829011025.605
>>> pow(2, 47)/10000/3600/24/365
446.27564800649424

Hash functions

A cryptographic hash function maps data of arbitrary size to a fixed size, and has some special properties. 加密哈希函数将任意大小的数据映射到一个固定的大小,并具有一些特殊的属性。

hash(value: array<byte>) -> vector<byte, N>  (for some fixed N)
printf 'hello' | sha1sum # sha1
  • Deterministic
  • Non-invertible
  • Target collision resistant
  • Collision resistant

Key Derivation Function KDF

A related concept to cryptographic hashes, key derivation functions (KDFs) are used for a number of applications, including producing fixed-length output for use as keys in other cryptographic algorithms. Usually, KDFs are deliberately slow, in order to slow down offline brute-force attacks.

密钥派生函数(KDFs)是一个与加密散列相关的概念,它被用于一系列的应用,包括产生固定长度的输出,用于其他加密算法中的密钥。 通常情况下,KDFs是刻意的慢,以减缓离线蛮力攻击。

Symmetric cryptography

Hiding message contents is probably the first concept you think about when you think about cryptography. Symmetric cryptography accomplishes this with the following set of functionality:

隐藏消息内容可能是你想到密码学时第一个想到的概念。对称密码学通过以下一组功能实现了这一点。

keygen() -> key  (this function is randomized)

encrypt(plaintext: array<byte>, key) -> array<byte>  (the ciphertext)
decrypt(ciphertext: array<byte>, key) -> array<byte>  (the plaintext)

decrypt(encrypt(m, k), k) = m // aes256

passphrase -> KDF -> key->
                    plaintext -> cipheretext

for example,

openssl aes-256-cbc -alt -in file.txt -out file.enc.txt

Asymmetric cryptography

The term “asymmetric” refers to there being two keys, with two different roles. A private key, as its name implies, is meant to be kept private, while the public key can be publicly shared and it won’t affect security (unlike sharing the key in a symmetric cryptosystem). Asymmetric cryptosystems provide the following set of functionality, to encrypt/decrypt and to sign/verify:

所谓 "非对称 "是指有两个密钥,有两个不同的作用。私钥,顾名思义,是为了保持私密性,而公钥可以公开共享,不会影响安全性(与对称密码系统中的共享密钥不同)。 非对称加密系统提供了以下一组功能,用于加密/解密和签名/验证。

keygen() -> (public key, private key)  (this function is randomized)

encrypt(plaintext: array<byte>, public key) -> array<byte>  (the ciphertext)
decrypt(ciphertext: array<byte>, private key) -> array<byte>  (the plaintext)

sign(message: array<byte>, private key) -> array<byte>  (the signature)
verify(message: array<byte>, signature: array<byte>, public key) -> bool  (whether or not the signature is valid)

decrypt(encrypt(m, public key), private key) = m

verify(message, signature, public key)

verify(message, sign(message, private key), public key) = true

PGP signature encrypt email, IM like signal, telegram, deb software siging,

git tag
git cat-file -p v1.3.0
git tag -v v1.3.0

RSA, web of trust, chain of trust, keybase.io bootstrap

Hybrid cryptography

Always combined together in real world.

Debugging and Profiling

debugging

  • log level journactl systemctl
  • debugger ipdb gdb
  • strace
  • static analysis tools pyflakes mypy writegood

profiling

  • timing real time, user time, system time
  • prifiler CPU prifiler tracing profiler, sample profiler

cProfiles kernprof ling profiling

memory_profiler

sudo perf stat stress -c 1 sudo perf record stress -c 1

flame graph call graph

htop

stress -c 4 du -h /vedios

ncdu

lsof

hyperfine --warmup 3 'fd -e jpg' 'find . -name "*.jpg"'

命令行环境 Command-line Environment

Job Control: signal

nohup

hup kill stop

Terminal Mutiplexer

tmux

ctrl+b change key bind to ctrl+a

Sessions -> windows -> panes

htop

Dotfiles

alias
~/.bashrc
~/.vimrc

symlink

Remote Machine

ssh root@1.2.3.4 ls -al | grep primes
scp
rsync
ssh config

Potpourri 大杂烩

  • keyboard remapping
  • deamons
  • file systems in user space SSHFS
  • backup
  • APIs curl jq authenticate tokens IFTTT --
  • Window manager
  • VPNs
  • markdown
  • Hammerspoon
  • Booting + Live USBs
  • Docker, Vagrant, VMs, Cloud, OpenStack
  • Notebook programming Jupyter, Wolfram Mathematica
  • GitHub

eBPF column -t pandoc R ggplot2

vim D-string

https://missing.csail.mit.edu/