-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
166 lines (123 loc) · 3.9 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
Huahin Tools
Huahin Tools is set of tools ont the Hadoop MapReduce.
Huahin Tools is distributed under Apache License 2.0.
-----------------------------------------------------------------------------
Documentation
http://huahinframework.org/huahin-tools/
-----------------------------------------------------------------------------
Requirements
* Java 6+
-----------------------------------------------------------------------------
Install Huahin Tools
~ $ tar xzf huahin-tools-x.x.x.tar.gz
-----------------------------------------------------------------------------
Amzon Elastic MapReduce
upload jar fila to S3: huahin-tools.jar
Set Jar path examples:
s3://huahin/tools/huahin-tools.jar
-----------------------------------------------------------------------------
Run Huahin Tools
To run Huahin Tools use bin/huahin-tools script. For example:
$ bin/huahin-tools -j formatting -i ../input/ -o output
-----------------------------------------------------------------------------
Common Arguments
-l local mode. Map is running MultithreadedMapper.
-t thread number option. MultithreadedMapper number of thread.
-s split size.
-j job name.
-----------------------------------------------------------------------------
Tools
* Formatting
Formatting is formed after the split with a regular expression to the specified input file.
The default is the format of the default Apache log, so if you do not specify a regular expression.
arguments
required
-i data input path.
-o data output path.
option
-p separator pattern. Specify a regular expression. Default is default Apache log format.
-e specific outputs number. Specify a regular expression output group number.
-n group number of specify a regular expression.
For example:
$ ./bin/huahin-tools -l -t 4 -s 2147483648 \
-j formatting \
-i /tmp/input/ \
-o /tmp/output/ \
-e 0,3,5 \
-n 11
* wc
wc is the wc command of Linux(-l option only).
arguments
required
-i data input path.
-o data output path.
For example:
$ ./bin/huahin-tools -l -t 4 -s 2147483648 \
-j wc \
-i /tmp/input/ \
-o /tmp/output/
* cut
cut is th cut command of Linux.
arguments
required
-i data input path.
-o data output path.
option
-f specified column. 1 or 1,2 or 1-4.
-d delimiter. Default TAB.
For example:
$ ./bin/huahin-tools -l -t 4 -s 2147483648 \
-j cut \
-i /tmp/input/ \
-o /tmp/output/ \
-f 3,4
* ccext
ccext extracts the row number of the specified column.
arguments
required
-i data input path.
-o data output path.
option
-n Number of the specified column.
-v reverse of -n.
-d delimiter. Default TAB.
For example:
$ ./bin/huahin-tools -l -t 4 -s 2147483648 \
-j ccext \
-i /tmp/input/ \
-o /tmp/output/ \
-n 12 \
-v
* urldec
urldec is to URL decode the specified columns.
arguments
required
-i data input path.
-o data output path.
option
-f decode the specified columns. 1 or 1,2 or 1-4.
-d delimiter. Default TAB.
For example:
$ ./bin/huahin-tools -l -t 4 -s 2147483648 \
-j urldec \
-i /tmp/input/ \
-o /tmp/output/ \
-f 3,5
* urlsw
urlsw extracts the URL keyword from the specified column.
arguments
required
-i data input path.
-o data output path.
option
-f decode the specified columns. 1 or 1,2 or 1-4.
-m master file path. huahin-tools comes with master/SearchEngine.tsv.
format is TSV(search engine host/search engine path/search engine query name).
-d delimiter. Default TAB.
For example:
$ ./bin/huahin-tools -l -t 4 -s 2147483648 \
-j urlsw \
-i /tmp/input/ \
-o /tmp/output/ \
-f 3,5 \
-m master/SearchEngine.tsv