INFO 212 数据科学编程I笔记

发表于 2022-02-25 分类于笔记

220225

数据

数据类型：结构化/非结构化、离散/连续、枚举/二元/有序/数值（绝对/相对/区间/比例）

数据科学

数据科学生命周期

数据科学编程

功能编程 Functional -> 科学编程 Data Science

不刻意注重实现细节 -> 利用工具实现(I)/算法优化(II)

可用语言

Python、R、SQL、Matlab、Scala

R：统计、可视化

Scala：实际应用，较新。

实验部分

scikit-learn

目标：在电脑上运行

前置

安装：

Ref.

实验1

Run k-means clustering on your machine

k-means++

实验2

Run OPTICS clustering on your machine

OPTICS

使用jupyter notebook可以运行

补充：

使用虚拟环境virtual env

pip install virtualenv # 安装

virtualenv [venvDir] # 创建

virtualenv -p [dir of python2/3] [venv dir] # 指定python版本

source [venv dir]/bin/activate # 运行虚拟环境

deactivate # 终止虚拟环境

M1使用虚拟环境可能存在问题，使用jupyter notebook即可。

20220304 (lab 2)

~~DDL: 2022-03-11 18:10~~

基本算是对.csv处理的大模拟。处理难度不算高，就是需要手写统计上的一些计算，一层循环也还好。感觉就是考察一下码力和对module的熟悉程度（~~完全不熟,全靠现学~~），顺手用了str.split()、namedtuple、datetime之类的，总共三小时多大概。

20220318 (week 4)

lab session: 统计有哪些IP访问过nginx？哪个IP访问最多，多少次？

file (access.csv)

过程：

用cat access.csv | awk -F ' ' '{ print $1 }' > ips.txt将所有的IP信息导出到ips.txt。因为没有限制方法，所以对ips.txt采用python进行处理。

f = open("ips.txt")

d = dict()

for line in f.readlines():
	d.setdefault(line,0)
	d[line] += 1

l = list(d.items())
l.sort(key=lambda x:x[1],reverse=True)
print('IPs:\n', '\n'.join(map(str,l)))
print('unique IPs:\n', ''.join(map(lambda x:str(x[0]),l)))

IPs:
('110.42.250.65\n', 3629)
('175.178.89.241\n', 33)
('27.184.88.168\n', 29)
('180.109.90.227\n', 26)
('1.202.162.44\n', 16)
('115.155.91.183\n', 6)
('45.146.165.37\n', 5)
('202.201.3.82\n', 5)
('27.227.141.127\n', 5)
('210.26.56.179\n', 5)
('42.94.64.38\n', 4)
('5.188.210.227\n', 3)
('115.155.65.83\n', 3)
('121.46.142.244\n', 2)
('103.162.30.101\n', 2)
('83.97.20.34\n', 2)
('5.181.80.168\n', 2)
('194.31.98.117\n', 2)
('157.245.70.127\n', 2)
('175.100.20.197\n', 2)
('172.21.203.16\n', 2)
('115.155.91.38\n', 2)
('210.26.112.99\n', 2)
('117.136.27.184\n', 2)
('115.155.111.47\n', 2)
('27.115.124.38\n', 2)
('115.155.115.228\n', 2)
('129.151.35.240\n', 1)
('45.85.190.62\n', 1)
('162.221.192.26\n', 1)
('23.95.100.141\n', 1)
('2.56.57.187\n', 1)
('35.205.189.107\n', 1)
('20.110.176.213\n', 1)
('192.241.224.147\n', 1)
('190.212.140.11\n', 1)
('40.77.167.97\n', 1)
('207.46.13.22\n', 1)
('103.161.17.249\n', 1)
('185.173.35.5\n', 1)
('159.223.224.155\n', 1)
('130.211.54.158\n', 1)
('220.181.108.173\n', 1)
('172.104.242.173\n', 1)
('139.162.200.4\n', 1)
('37.57.137.208\n', 1)
('123.160.233.170\n', 1)
('121.29.188.91\n', 1)
('59.52.102.30\n', 1)
('220.250.62.155\n', 1)
('140.224.64.164\n', 1)
('221.213.75.6\n', 1)
('111.85.200.151\n', 1)
('123.145.3.219\n', 1)
('111.85.200.199\n', 1)
('36.5.220.240\n', 1)
('115.148.153.229\n', 1)
('223.104.57.34\n', 1)
('81.69.57.239\n', 1)
('192.241.223.176\n', 1)
('64.225.3.187\n', 1)
('193.118.53.202\n', 1)
('219.240.249.185\n', 1)

由输出前一部分（如上）得到排序后的IP出现次数。最多的应该是110.42.250.65，访问3629次。

由后一部分（如下）得到所有IP。

unique IPs:
110.42.250.65
175.178.89.241
27.184.88.168
180.109.90.227
1.202.162.44
115.155.91.183
45.146.165.37
202.201.3.82
27.227.141.127
210.26.56.179
42.94.64.38
5.188.210.227
115.155.65.83
121.46.142.244
103.162.30.101
83.97.20.34
5.181.80.168
194.31.98.117
157.245.70.127
175.100.20.197
172.21.203.16
115.155.91.38
210.26.112.99
117.136.27.184
115.155.111.47
27.115.124.38
115.155.115.228
129.151.35.240
45.85.190.62
162.221.192.26
23.95.100.141
2.56.57.187
35.205.189.107
20.110.176.213
192.241.224.147
190.212.140.11
40.77.167.97
207.46.13.22
103.161.17.249
185.173.35.5
159.223.224.155
130.211.54.158
220.181.108.173
172.104.242.173
139.162.200.4
37.57.137.208
123.160.233.170
121.29.188.91
59.52.102.30
220.250.62.155
140.224.64.164
221.213.75.6
111.85.200.151
123.145.3.219
111.85.200.199
36.5.220.240
115.148.153.229
223.104.57.34
81.69.57.239
192.241.223.176
64.225.3.187
193.118.53.202
219.240.249.185

final solution

cat access.csv | awk -F ' ' '{ print $1 }' | sort | uniq -c

output:

  16 1.202.162.44
   1 103.161.17.249
   2 103.162.30.101
3629 110.42.250.65
   1 111.85.200.151
   1 111.85.200.199
   1 115.148.153.229
   2 115.155.111.47
   2 115.155.115.228
   3 115.155.65.83
   6 115.155.91.183
   2 115.155.91.38
   2 117.136.27.184
   1 121.29.188.91
   2 121.46.142.244
   1 123.145.3.219
   1 123.160.233.170
   1 129.151.35.240
   1 130.211.54.158
   1 139.162.200.4
   1 140.224.64.164
   2 157.245.70.127
   1 159.223.224.155
   1 162.221.192.26
   1 172.104.242.173
   2 172.21.203.16
   2 175.100.20.197
  33 175.178.89.241
  26 180.109.90.227
   1 185.173.35.5
   1 190.212.140.11
   1 192.241.223.176
   1 192.241.224.147
   1 193.118.53.202
   2 194.31.98.117
   1 2.56.57.187
   1 20.110.176.213
   5 202.201.3.82
   1 207.46.13.22
   2 210.26.112.99
   5 210.26.56.179
   1 219.240.249.185
   1 220.181.108.173
   1 220.250.62.155
   1 221.213.75.6
   1 223.104.57.34
   1 23.95.100.141
   2 27.115.124.38
  29 27.184.88.168
   5 27.227.141.127
   1 35.205.189.107
   1 36.5.220.240
   1 37.57.137.208
   1 40.77.167.97
   4 42.94.64.38
   5 45.146.165.37
   1 45.85.190.62
   2 5.181.80.168
   3 5.188.210.227
   1 59.52.102.30
   1 64.225.3.187
   1 81.69.57.239
   2 83.97.20.34

20220325 (week 5 & lab 5)

lab5

access.csv

目标：

统计IP信息
描述那些工具可用于数据科学？
用法举例

注释：

log格式如下

1
2
3

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for"';

过程

IP部分

可以参考前述内容，对于格式化的回答可以采用awk解决，具体如下：

#!/bin/bash

# detailed user agent(UA) format: $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_x_forwarded_for"

# counting IP

# split each piece of record using character ' ', then $1 is $remote_addr

cat access.csv | awk -F ' ' '{ print $1 }' | sort | uniq -c | awk -F ' ' '
    BEGIN { 
        print "Counting IP";
        print "The IPs that visited the nginx server:"
        maxOccur = 0; 
        maxAddr = ""; 
    } 
    { 
        if ( $1 > maxOccur) { 
            maxOccur = $1; 
            maxAddr = $2; 
        } 
        print $2; 
    } 
    END { 
        printf "The IP that visited the most times is %s, %s times in total.\n\n", maxAddr, maxOccur 
    }'

其中，awk的教程和语法可以参考以下内容：

浏览器信息统计部分

可参考信息：

可用工具有：

但需要注意的是Edge浏览器，经过实测以及access.csv中提供的UA信息均可得，Edge的UA信息末尾显示为Edg/version，并非Edge/version。

此处打算采用乱搞的特判（结合数据），大概能判（大概）(确实可以)。

# counting explorers

# split each piece of record using character '"', then $6 is $http_user_agent

# More information: https://en.wikipedia.org/wiki/User_agent
# User agent tool: https://useragent.buyaocha.com
# It should be declared that Edge has a UA ends with `Edg/version`, instead of `Edge/version`, this conclusion can be obtained by `access.csv` and test using Edge.
# Here, the method of classifying the explorers using the special matching, like unique identifier, although it may not always be accurate.

cat access.csv | awk -F '"' '{ print $6 }' | awk '{
    if ( $0 ~ /Edg/ ) {
        print "#Microsoft Edge"; # Here the character `#` denotes the separator between the count and the name. Otherwize, the processing of split may be difficult in next steps. 
    }
    else if ( $0 ~ /Firefox/ ) { 
        print "#Firefox";
    }
    else if ( $0 ~ /Chromium/ ) {
        print "#Chromium";
    }
    else if ( $0 ~ /Chrome/ || $0 ~ /CriOS/ ) {
        print "#Chrome";
    }
    else if ( $0 ~ /Safari/ ) {
        print "#Safari";
    }
    else if ( $0 ~ /Android/ && $0 ~ /WebKit/ ) {
        print "#WebKit";
    }
    else if ( $0 ~ /MSIE/ || $0 ~ /Trident/ ) {
        print "#MSIE";
    }
    else {
        printf "#%s\n", $0;
    }
}' | sort | uniq -c | awk -F '#' '
    BEGIN { 
        totCnt = 0;
        print "Counting explorers"
        print "Percent/%  Explorer Name";
    }
    {
        cnt[NR] = $1;
        totCnt += cnt[NR];
        name[NR] = $2;
    }
    END {
        for (i = 1; i <= NR; ++i) {
            printf "%8.3f%%  %-20s\n", 100 * cnt[i] / totCnt, name[i]
        }
        print "";
    }'

需要注意的是还有MSIE和Chromium，以及iOS上的Chrome，此处合并判断了。

awk的注释是#到行末，变量和数组不需要声明，可以直接使用。正则表达式用~。

可以参考的内容有：

系统信息统计部分

从上一部分的参考资料 (User agent) 里面能发现，系统信息是第一个括号内的内容。

此处可以提取第一个括号，然后统计。

# counting systems information

cat access.csv | awk -F '"' '{ print $6 }' | awk '{ 
    st = index($0, "(");
    en = index($0, ")");
    # Get the information from the first pair of bracket
    if (st && en) {
        info = substr($0, st+1, en-st-1);
    }
    else{
        info = $0;
    }
    # print info;
    if ( info ~ /iPhone/ ) {
        print "#iOS"; # Here the character `#` denotes the separator between the count and the name. Otherwize, the processing of split may be difficult in next steps. 
    }
    else if ( info ~ /iPad/ ) {
        print "#iPadOS";
    }
    else if ( info ~ /Mac OS/ ) {
        print "#macOS";
    }
    else if ( info ~ /Windows NT/ ) {
        print "#Windows";
    }
    else if ( info ~ /Android/ ) {
        print "#Android";
    }
    else if ( info ~ /Linux/ ) {
        print "#Linux";
    }
    else if ( info ~ /OpenBSD/ ) {
        print "#openBSD";
    }
    else {
        printf "#%s\n", info;
    }
 }' | sort | uniq -c | awk -F '#' '
    BEGIN { 
        totCnt = 0;
        print "Counting systems"
        print "Percent/%  System Name";
    }
    {
        cnt[NR] = $1;
        totCnt += cnt[NR];
        name[NR] = $2;
    }
    END {
        for (i = 1; i <= NR; ++i) {
            printf "%8.3f%%  %-20s\n", 100 * cnt[i] / totCnt, name[i]
        }
        print "";
    }'

OpenBSD也是操作系统（类Unix），iPhone和iPad在数据中没有出现，但根据已有信息可知，其判断需要在macOS之前。

至此，三部分拼接起来就得到了lab5.sh。第二问和第三问中按实际情况，没有提取出常见系统信息的记录保留了原信息。

20220328 Week 6 (T1)

numpy (Run in iPython/Jupyter)

import numpy as np

np.read_csv('file') # read csv file
np # show the table
np.drop(['column_names'],axis=1) # drop columns

处理表格、多维数组、图表

一些工具: pandas, numpy, matplotlib, scikit learn, statsmodels, SciPy

一些语言: R, SPSS, Matlab, GNU Octave, Fortran(1950s, very old, very efficient), Julia(2012, very efficient)

main tools: python3, ipython, jupyter notebooks, git (for version control)

Syllabus:

Python basics, ipython, jupyter notebooks
built-in data structures, functions files
numpy basics
pandas basics
matplotlib basics
file manipulation, git (and files formats)
data cleaning and preparation
data wrangling, joining, combining, reshaping
data aggregation and grouping

datasets source: kaggle

Assignments:

Evaluation

Three assignments:

Assignment 1 (GreenHub dataset, NumPy) - 15%
Assignment 2 (League of Legends, Pandas and matplotlib) - 25%
Assignment 3 (GreenHub again, all the covered materials) - 40%

One test at the end. - 20%

20220330 Week 6 (T2)

python basis recap

20220408 Week 7 (T3)

numpy

array: broadcasting

ex1: Build a function that receives two arrays as
arguments and, solely based on their shape,
outputs which one of them has more elements.
These arrays do not need to have the same
number of dimensions.

# from functools import reduce

def compareSize(a,b):
    # sizeA = reduce(lambda x,y:x*y, a.shape)
    sizeA = np.array(a.shape).prod()
    # sizeB = reduce(lambda x,y:x*y, b.shape)
    sizeB = np.array(b.shape).prod()
    if sizeA > sizeB:
        return a
    else:
        return b

ex2: Write a function that takes as input a 2- dimensional NumPy array and transposes it without using the T attribute nor the transpose() function.

def transpose(a):
    shapes = a.shape
    b = np.zeros((shapes[1],shapes[0]))
    for i in range(shapes[0]):
        for j in range(shapes[1]):
            b[j,i] = a[i,j]
    return b