'2019/03 글 목록

programmer

etc 2019. 3. 28. 09:06

'etc' 카테고리의 다른 글

[수학] -1*-1=1인 이유는 (0)	2017.01.19
[수학] 1/0은 왜 안되나 (1)	2017.01.19
부동 소수점 설명 (0)	2017.01.18
부동 소수점의 이해 (0)	2017.01.18
particular, specific, certain (0)	2016.09.09

:

Secure Shell: How Does SSH Work

보안 2019. 3. 28. 09:03

Taking remote shell, for carrying out different tasks is a norm, if you have multiple server machine's in your infrastructure. Different protocols and tools were made to accomplish this task of taking a remote shell. Although the tools made during the initial days were capable enough to carry out necessary shell related tasks, there were different design concerns, that resulted in advancements and new tools to accomplish this task.

In this tutorial guide, we will be discussing one such tool, that was designed to eliminate the flaws in previous remote shell programs. Our topic of interest for this tutorial is none other than the Secure Sell, better known as SSH.

The key characteristics that makes a remote login program an efficient one is pointed out in the below list.

The first and the foremost is the privacy of the communication. This means the connection, which provides a remote shell login, must be encrypted to prevent eavesdropping.
There must be a mechanism to check whether the data send by either party is not altered, or tampered with. In short, integrity check is a must.
Identity of both the server and the client must be provided to each other, to establish a proper authentication.

A wonderful thing about the Secure Shell (SSH), is the fact that, it incorporates all the above mentioned characteristics, in addition to some unique features of its own.

Encryption and authentication mechanisms provided by SSH enhances security to a greater extent, because mostly the communication occurs through a medium, which is unsecured(The Internet). This is majorly due to the fact that, ssh was made to replace some insecure remote login programs like rlogin,telnet etc.

SSH provides multiple mechanisms for authenticating the server and the client. Two of the commonly used authentication mechanism are password based, and key based authentication. Although password based authentication is also secure, its advisable to use key based authentication instead. As i mentioned before, there are some added features apart from the secure authentication and data encryption provided by ssh. Some of the well known features of SSH are mentioned below.

SSH Tunneling
TCP port forwarding

We will not be discussing the above mentioned features of SSH in this tutorial, instead will be discussing how an SSH connection from a server to a client work. We will be discussing the complete workflow of an SSH connection in detail. Iterations in protocols to improve the overall working, is a norm. SSH protocol also has different iterations and among them the most widely used versions are SSH version 1 and SSH version 2.

The current default and recommended version of SSH is SSH protocol Version 2. We will be beginning by discussing both these versions separately and will be ending with a comparison of both these two versions of SSH, along with its workflow.

When we discuss encryption and data security, there are two types of primarily used cryptographic systems. One is Public Key cryptography(or sometimes called as asymmetric cryptography) & the other is Secret key cryptography (or sometimes called as symmetric cryptography). Most of the modern day security system's use these two types, in multiple ways to ensure security in communication. I have two article's that discuss tools related to symmetric and asymmetric encryption.

The above tutorial's does not discuss cryptography, but it does describe its application in real life. We will be including a few articles related to cryptography in the coming days (I know its quite difficult to discuss cryptography details, but will surely give it an attempt.

)

Workflow of Secure Shell(SSH) Protocol Version 1

The major confusion, that's widely found among industry people, is that "SSH works on Public key encryption and not on Secret Key encryption". I would like to take this opportunity to clear this confusion here. Asymmetric encryption has a lot of overhead involved, and is a little bit time consuming to decrypt data using it.

Due to which most of the protocol employs Public key cryptography(asymmetric encryption), only to share the secret key used for symmetric encryption, which will be used as a primary encryption mechanism for the entire data communication in that session. To make things clear "Asymmetric encryption is only used to share the secret key, which will be used for symmetric encryption"

So if you think logically, the first step, that needs to be taken while establishing an SSH connection, is to make a secure channel between the server and the client.

First the client authenticates the server, because client is the one that initiates the connection. After the server is authenticated, and the client is sure about the identity of the server, a secure symmetric channel is formed between them.

This secure channel will be used for authenticating the client,sharing keys,passwords,and other things. For understanding how this works, let's go through a step by step process.

Step 1

A connection is always initiated by the client to the server. So the first step is to establish a TCP connection to port 22 on the server. Let's see what we get when we connect to port 22 on the server.

[root@slashroot1 ~]# telnet 192.168.0.105 22
<div id="edfs"><a href="http://graciatelevisio.cat/payday-loans-direct-lenders-instant-approval">graciatelevisio.cat/payday-loans-direct-lenders-instant-approval</a></div>
Trying 192.168.0.105...
Connected to 192.168.0.105 (192.168.0.105).
Escape character is '^]'.
SSH-2.0-OpenSSH_4.3

The client gets two valuable information from the above connection. One is the Protocol version supported by the server. Second is the ssh server package version(which is not necessary. In fact you should not reveal this message to the client)

At this point, the client will continue if it supports version 2 protocol, otherwise will break the connection.

Step 2

As soon as the client decides whether it should continue, based on the protocol version shown by the server, server and client will now switch to a "Binary Packet Protocol"

This protocol contains a packet length of 32 bits(This length is excluding the length field, and message authentication field), padding

Step 3:

The server will now send some of the critical information to the client. This information will be playing a major role in the current session of the login.

The information send by the server are as follows.

1. The server will disclose its identity to the client. This identity is a rsa public key of the server. This key is mostly stored in the below location on a server(for this example, i will be working on a centos 5 machine). It is created during the openssh server installation.

[root@slashroot1 ssh]# pwd
/etc/ssh
[root@slashroot1 ssh]# cat ssh_host_rsa_key.pub
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAxlwxaB4wKrFHGsqUYEzYtTIJWjgul8ML+bnnJIg0HER1wnW2QitRqDzw6f9cr3WKvwqAQh/Sf4bM0LqUAXZre+J6oiLY7X8V6NtEA8nHO1qryueNe44rI6HYunZ3yo4UAXUZqxhjer+tA8OCD6DLRfXOWIMsBUBXJuB+yl1/qGH2J0Kjrnpj17N0mPMqGmMb8+9EjV1Rs1aSDriIWjDsJIDd8fz4gRoelB5mFsEQ7rD+m/RNWxbAhkBoNcFadRg30LqhCtGYQsWADv0p4THCDVZxB3u9VSWK9qZRgF7LbGRdgiVgJjGDPqCO3cWlnQzxcZ9VdvKy+em1RB9BJ++kuw==

If the client is communicating with the server for the first time. The client will get a warning on his screen which will be something like the below.

[root@slashroot1 ~]# ssh 192.168.0.105
The authenticity of host '192.168.0.105 (192.168.0.105)' can't be established.
RSA key fingerprint is c7:14:f4:85:5f:52:cb:f9:53:56:9d:b3:0c:1e:a3:1f.
Are you sure you want to continue connecting (yes/no)?

The client will get the above warning only when he connects for the first time. After the first connection, this host identity key will be saved in a known_hosts file so that, in future you will not get a warning while connecting to this server.

The thing to understand here is that, the above key is a host identity, and not a user identity. Any client connecting to that server will get that same host-key as a server identity, so that if you connect to another machine, instead of this you will be warned(because the client does not have the identity in its known_host file)

2. The second information provided by the server to the client is the server key. This server key is akey exchanged from server to the client. This method is not used in ssh version 2, which will be discussed later.

This key is also regenerated each and every hour according to the default configuration. Its default size is 768 bits. You can see this in the ssh server configuration file (/etc/ssh/sshd_conf)

# Lifetime and size of ephemeral version 1 server key
#KeyRegenerationInterval 1h
#ServerKeyBits 768

3.

8 random bytes which are also called checkbytes. It is necessary for the client to send these bytes in its reply to the server, during its next reply.

4. Finally the server provides all supported encryption, as well as authentication methods.

Step 4:

According to the list of encryption algorithms supported by the server, the client simply creates a random symmetric key and sends that symmetric key to the server.

This symmetric key will be used to encrypt as well as decrypt the whole communication during this session.

The client takes an additional care while sending this session key(symmetric key for this session) to the server, by doing a double encryption. The first encryption is done by the server host key(which was shared by the server during step 3), and the second encryption is done by the server key(which will keep on changing every one hour.)

This double encryption increases the security, because even if somebody has the host private key of the server (/etc/ssh/ssh_host_rsa_key), he will not be able to decrypt the message, because its still encrypted by the server key, which keeps on changing on an hourly basis.

Along with this double encrypted session key, the client will also send the selected algorithm from the list of supported algorithm given by the server during step 3.

Step 4:

According to the list of encryption algorithms supported by the server, the client simply creates a random symmetric key and sends that symmetric key to the server.

This symmetric key will be used to encrypt as well as decrypt the whole communication during this session.

The client takes an additional care while sending this session key(symmetric key for this session) to the server, by doing a double encryption. The first encryption is done by the server host key(which was shared by the server during step 3), and the second encryption is done by the server key(which will keep on changing every one hour.)

This double encryption increases the security, because even if somebody has the host private key of the server (/etc/ssh/ssh_host_rsa_key), he will not be able to decrypt the message, because its still encrypted by the server key, which keeps on changing on an hourly basis.

Along with this double encrypted session key, the client will also send the selected algorithm from the list of supported algorithm given by the server during step 3.

Step 5:

If you notice, the client has still not authenticated the server. It only has the identity of the server, which was given by the server in the form of server host key.

In order to authenticate the server, the client needs to be sure, that the server was able to decrypt the session key send during step 4. So after sending the session key(which is double encrypted with server key and server host key), the client waits for a confirmation message from the server.

The confirmation from the server must be encrypted with the symmetric session key, which the client send. This step of waiting for a confirmation message is very important for the client, because the client has no other way to verify whether the server host key send, was from the intended server.

Once the client receives a confirmation from the server, both of them can now start the communication with this symmetric encryption key.

But till now only the server is authenticated. The client is yet to be authenticated by the server.

Client Authentication methods supported by SSH

Now the complete communication will be in a symmetric encrypted form, with the help of the session key.

The client authentication happens over this encrypted channel. There are multiple methods that can be used to authenticate the client. Some of them are mentioned below.

Public Key
Rhosts
Password
Kerberos

Related : Kerberos and its Working

We will only be discussing the two most commonly used methods here. Passwords & public key method of authentication.

Password authentication

Password based authentication is simple, and is the most commonly used authentication methods in ssh. It is exactly the same as you log in to a local user using the correct password. The remote server on getting the passwords, logs in the user, based on the server's native password authentication mechanism.

Note the fact that the password transmitted by the client to the server is encrypted through the session symmetric key(which only the server and the client knows)

Public Key Authentication

The second authentication method is public key authentication method. Public key authentication in secure shell is the strongest authentication methods, that can be used to authenticate the client.

For this authentication to work, the client first needs to create an RSA public and private key. Which can be done by a command called ssh-keygen. Always keep in mind, that this key generation is only used for authenticating the client, and not used for encrypting the complete session. Encryption of the complete ssh session is already established by a symmetric session key which was previously shared by the client and the server.

Let's see what happens in this public key authentication. We will first need to create two keys(One private and one public). This public key will be given to all those server's where your require authentication. So a client that needs log-in to multiple servers using public key, needs to distribute his key to those multiple servers first. Any data encrypted with that public key, can only be decrypted with the corresponding private key(which is only with the original client.)

[root@slashroot1 ~]# cd .ssh/
[root@slashroot1 .ssh]# ll -a
total 16
drwx------ 2 root root 4096 Feb 25 14:19 .
drwxr-x--- 9 root root 4096 Feb 25 09:58 ..
-rw-r--r-- 1 root root    0 Feb 25 14:08 known_hosts
[root@slashroot1 .ssh]#

The above shown directory of .ssh (which is a directory that contains the user specific ssh details. Its inside the home directory of the user.).

As of now the directory only has one file called known_hosts. This is the file which will contain the complete list of server host keys line by line. We have previously seen that when a user connects to an ssh server for the first time, the user will get a warning of the server host key(because the client does not have the entry of the server host key in the file, which proves the server identity).

After the first connection the server host key is saved in the file known_hosts, so that when you connect again you will not get a warning(this warning is very much helpful, because it will let you know if you are logging into an attacker's machine instead of the original server).

This will be the location(~/.ssh), where the keys for public key authentication will be saved. Lets create our keys for this authentication.

[root@slashroot1 ~]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
47:b0:9a:e5:7f:ca:df:ca:aa:20:4e:68:2d:ed:dc:a5 root@slashroot1.slashroot.in
[root@slashroot1 ~]#

Keep in mind that anything that is encrypted with the public key(id_rsa.pub), can only be decrypted with the corresponding private key (which in our case is id_rsa)If you see clearly the above command, has created two files in the .ssh directory. One is the private key (id_rsa) which will by default have a permission of 600, and the other is the public key, which will be shared to the server(id_rsa.pub)

Now let's share this id_rsa.pub, with the server. The server will keep this public key inside its list of authorized hosts. Sharing this public key does not mean sharing the file id_rsa.pub. Sharing means the content of the file id_rsa.pub, must be there in the file authorized_keys on the server. This authorized hosts file is also located in the directory .ssh inside the home directory of the target user.

Let's get back to the point where we stopped at step 5.

From the list of authentication method's supported by the server, the client will select a public key authentication and will also send the the details about the cryptography used for this public key authentication.

The server on receiving the public key authentication request, will first generate a random 256 bit string as a challenge for the client, and encrypt it with the client public key, which is inside theauthorized_keys file.

The client on receiving the challenge, will decrypt it with the private key(id_rsa). The client will not send the challenge string as it is. But will combine that string with the session key(which was previously negotiated and is being used for symmetric encryption.) And will generate a md5 hash value. This hash value is send to the server. The server on receiving the hash, will regenerate it(because the server also has both the random string as well as the session key), and if it matches the hash send by the client, the client authentication succeeds.

Now let's get inside SSH protocol version 2.

SSH Protocol Version 2

SSH protocol version 2 is the default protocol used these days. This is due to some major advancements in version 2 compared to version 1. The workflow of the ssh login is almost same as that of version 1, however there are some major changes done in the protocol level. Some of these changes include improved encryption standards, Public key certification, much better message authentication codes, reassignment of session key etc.

If you are using centos/rhel 5 like me, most of you might be using openssh version 4.3 or higher. These ssh client's, by default selects ssh protocol version 2 for login, and it will fallback to version 1 if the server does not support version 2.

Multiple functions like key exchange, authentication, encryption were all part of a single protocol in version 1, due to which it is sometimes called as monolithic. SSH version 2 implements these different functions in different protocol(which combines together to make protocol version 2). Let's see these different internal protocol's inside ssh version 2.

Transport Layer Protocol
Connection protocol
Authentication Protocol

The above three protocol's inside ssh version 2 is defined in seperate RFC's.

SSH version 1 is very much limited in its support for wide range of algorithms that can be used for session key exchange, message authentication codes, compression algorithms etc. SSH 2 gives pretty good number of choices for the client to select from. SSH version 2 even has a space for adding your own custom algorithm.

During our discussion regarding session key (which is the symmetric key used for encrypting the complete session ), we saw that the client after selecting one algorithm from the list of supported by the server, generates a symmetric key and sends it to the server with double encryption(first encryption with the server host key and then with the server key, which keeps on changing every hour). In ssh version 2, there is no concept of server key. Instead it the server provides with the list of supported key exchange methods, from which the client selects one. As of now ssh version 2 works on diffie-hellmangroup1-sha1 for exchanging keys. I will recommend reading the below article of Wikipedia, to get an idea about diffie-hellmangroup1-sha1.

Read: diffie-hellman key exchange method

The basic idea behind this change is that no single party(client or the server), should decide the session key. All the supported key exchange algorithm that will be added in the future, will consist this property(nieigther server nor the client can dictate the session key)

So there is no concept of server key (which is altered every hour) in ssh version 2.

The second major change in SSH version 2 is the inclusion of a concept called as certificate authority(CA), who will sign the public key's used in the communication. This is exactly the same method used in SSL. However this is never implemented in real world as of now. But the protocol has already a room made for this.

Message authentication code's are used in any secure communication to verify the integrity of the message. Even SSH version 1 uses a message authentication code called as CRC-32(Cyclic Redundancy Check). CRC although does check alteration in data, but is not considered best when security is a major concern. SSH 2 uses advanced encryption standard based MAC. Some of the supported ssh 2 message authentication codes are as below.

hmac-md5
hmac-sha1
hmac-ripemd160

Rekeying of session key

SSH includes an added functionality called as rekying of the session key. What happens is, previously in ssh version 1, only one session key was used for the entire session. These days, almost all activities are carried out by ssh session. I myself login to a remote server, and leave that session as it is for weeks together, to continue from where i left.

In such login ssh session's its better to change the session key without breaking the session. That's a feature in ssh version 2.

let's now have a summary of the differences between ssh version 1 and ssh version 2

Diffie-Hellman key is used instead of the server key for sharing the session key in version 2 protocol
No Rhosts support in ssh 2
SSH protocol version 1 only allows negotiation of the symmetric encryption algorithm, all other things are hard corded(mac, compression etc)
SSH 2 supports certificates for public keys used
SSH 2 server can dictate the client to use multiple authentication methods in a single session to succeed. However ssh version 1 only supports one method per session
SSH version 2 allows the change of session key periodically.

Hope the above article was helpful in understanding the workflow of ssh. And the differences between two version's of SSH.

출처 - https://www.slashroot.in/secure-shell-how-does-ssh-work

'보안' 카테고리의 다른 글

SAML (0)	2014.04.18
SSO(Single Sign-On) (0)	2014.04.18
SAML 기반의 web sso 원리 정리 (0)	2014.04.18
SSO 구현 방법 (0)	2014.04.18
SSO( Single Sign On ) (0)	2014.04.18

:

Click Through Rate Prediction

머신러닝 2019. 3. 21. 15:29

Kaggle
- Avazu - Predict whether a mobile ad will be clicked
  - Data
  - Beat the benchmark with less than 1MB of memory
- CriteoLabs - Display Advertising Challenge
  - Data
  - md5sum 확인용
  - Beat the benchmark with less than 200MB of memory

Kaggle

Avazu - Predict whether a mobile ad will be clicked

https://www.kaggle.com/c/avazu-ctr-prediction

Data

https://www.kaggle.com/c/avazu-ctr-prediction/data 에서 제공하는 test.gz, train.gz 파일을 내려받는다.

train.gz 파일의 크기는 1.12GB 이다.

Data Field

id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 -- anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 -- anonymized categorical variables

4 Idiots' Solution & LIBFFM

https://www.kaggle.com/c/avazu-ctr-prediction/discussion/12608

Beat the benchmark with less than 1MB of memory

https://www.kaggle.com/c/avazu-ctr-prediction/discussion/10927

논문, 구현체

https://www.csie.ntu.edu.tw/~cjlin/libffm/

https://github.com/guestwalk/kaggle-avazu

CriteoLabs - Display Advertising Challenge

https://www.kaggle.com/c/criteo-display-ad-challenge

Data

kaggle 의 data 페이지에서 제공하는 data download link 는 깨어져 있다. CriteoLab 홈페이지에서 다운로드 받을 수 있는 링크는 다음과 같다.

http://labs.criteo.com/2014/02/download-kaggle-display-advertising-challenge-dataset/
- https://s3-eu-west-1.amazonaws.com/criteo-labs/dac.tar.gz 위 labs.criteo.com 페이지에서 연결된 실제 다운로드 링크

위의 링크에서 제공하는 dac.tar.gz 파일은 약 4GB 의 크기이다. 인터넷을 통해 내려받는데 속도가 느려 시간이 10시간 이상 걸릴 수 있다.

https://jkkim.me/kaggle/dac.tar.gz - 내려받아 놓은 파일

md5sum 확인용

$ md5 *.gz
MD5 (dac.tar.gz) = df9b1b3766d9ff91d5ca3eb3d23bed27
MD5 (sampleSubmission.gz) = 39c3ff7b677a8de71412f7cb00c4e5f2
MD5 (test.gz) = 47e20bc113bd2009b46dd125bb987c76
MD5 (train.gz) = f65aa86a4d3e3219c17225bd301b64f6
$

Beat the benchmark with less than 200MB of memory

https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/10322

https://github.com/guestwalk/kaggle-2014-criteo

https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

https://medium.com/@chris_bour/what-i-learned-from-the-kaggle-criteo-data-science-odyssey-b7d1ba980e6

'머신러닝' 카테고리의 다른 글

Optimization Algorithms (0)	2019.08.29
optimizer 원리 (0)	2019.08.29
부스팅 기법의 이해 (0)	2019.03.06
회귀분석 강의노트 (0)	2019.03.06
최대우도법(Maximum Likelihood) (0)	2019.03.06

:

부스팅 기법의 이해

머신러닝 2019. 3. 6. 15:18

boosting기법이해.zip

boosting기법이해.z01

저작자표시 비영리

'머신러닝' 카테고리의 다른 글

optimizer 원리 (0)	2019.08.29
Click Through Rate Prediction (0)	2019.03.21
회귀분석 강의노트 (0)	2019.03.06
최대우도법(Maximum Likelihood) (0)	2019.03.06
로지스틱 회귀모델의 모수 추정 (0)	2019.03.06

:

회귀분석 강의노트

머신러닝 2019. 3. 6. 14:57

권세혁 교수 - 회귀분석 강의노트

한남대학교 통계학과 권세혁 교수

http://wolfpack.hnu.ac.kr/lecture/Regression/

1장 서론

2장 단순회귀, 추정 및 검정

3장 잔차분석

4장 다중회귀

5장 지시변수 모형

6장 다중공선성

7장 변수선택

8장 영향치, 이상치 진단

9장 로지스틱 회귀

10장 계량경제

ch6_multicollinearity.pdf

저작자표시 비영리

'머신러닝' 카테고리의 다른 글

Click Through Rate Prediction (0)	2019.03.21
부스팅 기법의 이해 (0)	2019.03.06
최대우도법(Maximum Likelihood) (0)	2019.03.06
로지스틱 회귀모델의 모수 추정 (0)	2019.03.06
로지스틱 함수 (0)	2019.03.06

:

최대우도법(Maximum Likelihood)

머신러닝 2019. 3. 6. 14:56

정의

어떤 확률변수에서 표집한 값들을 토대로 그 확률변수의 모수를 구하는 방법.

즉, 어떤 모수가 주어졌을 때, 원하는 값들이 나올 가능도를 최대로 만드는 모수를 선택하는 방법.

방법

어떤 모수 $θ$ 로 결정되는 확률변수들의 모임 $D_{θ} = (X_{1}, X_{2}, \dots, X_{n})$ 이 있고, $D_{θ}$ 의 확률밀도함수나 확률질량함수가 $f$ 이고, 그 확률변수들에서 각각 값 $x_{1}, x_{2}, \dots, x_{n}$ 을 얻었을 경우의 가능도 $L (θ)$ 는 다음과 같다.

$L (θ) = f_{θ} (x_{1}, x_{2}, \dots, x_{n})$

여기서 가능도를 최대로 만드는 $θ$ 는 다음과 같다.

$\hat{θ} = a r g m a x_{θ} L (θ)$

이 때 $X_{1}, X_{2}, \dots, X_{n}$ 이 모두 독립적이고 같은 확률분포를 가지고 있다면 $L$ 은 다음과 같이 표현이 가능하다.

$L (θ) = \prod_{i = 1}^{n} f_{θ} (x_{i})$

또한, 로그함수는 단조 증가하므로, $L$ 에 로그를 씌운 값의 최대값은 원래 값 $\hat{θ}$ 와 같고, 이 경우 계산이 비교적 간단해진다.

$L^{⋆} (θ) = \log (L (θ)) = \sum_{i = 1}^{n} \log f_{θ} (x_{i})$

$L (θ) = \prod_{i = 1}^{n} f_{θ} (x_{i})$

$= f_{θ} (x_{1}) \cdot f_{θ} (x_{2}) \cdot f_{θ} (x_{3}) \cdot \dots$

$\log (L (θ)) = \log (\prod_{i = 1}^{n} f_{θ} (x_{i}))$

$= l o g f_{θ} (x_{1}) + l o g f_{θ} (x_{2}) + l o g f_{θ} (x_{3}) + \dots$

예시 (모비율 추정)

대한민국의 모든 인구 중 한명을 표본으로 추출하는데 추출된 사람이 남자인지 여자인지를 알려고 한다고 하면, 이 때 표본 랜덤변수가 갖는 확률분포는 베르누이 분포를 따를 것이다. 베르누이 분포는 다음과 같다.

1회 시행 시 두 가지 결과에 의해 그 값이 각각 0 또는 1로 결정되는 확률변수 $X$ 에 대해서

$f (x) = p^{x} {(1 - p)}^{(1 - x)}$

그러면 총 $n$ 명에 대해 추출했을 때의 우도(likelihood)는 다음과 같이 정해진다.

$L (X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{n} = x_{n} | p) = \prod_{i = 1}^{n} p^{x_{i}} {(1 - p)}^{1 - x_{i}} \dots (a)$

즉, 위 식은 다음과 같이 설명할 수 있다. 가령 10명의 사람을 추출했는데 1번부터 10번 사람까지의 성별이 각각 {남, 여, 남, 남, 여, 여, 남, 남, 여, 남} 이라고 해보자.

남자라면 $X_{i} = 0$ 이라고 하고 여자라면 $X_{i} = 1$ 이라고 결정한다고 했을 때, 현 상태에서 $x_{i} (i = 1, 2, \dots, 10)$ 은

{0, 1, 0, 0, 1, 1, 0, 0, 1, 0} 이라고 할 수 있다. 그러면 식 $(a)$ 는 다음과 같을 것이다.

$L (X_{1} = 0, X_{2} = 1, \dots, X_{1} 0 = 0 | p) = \prod_{i = 1}^{10} p^{x_{i}} {(1 - p)}^{1 - x_{i}}$

하지만 여전히 $p$ (표본이 여성일 확률)를 알 수 없기 때문에 확률 $f$ 를 최대화 할 수 있는 모수 $p$ 를 찾도록 최대우도법을 시행한다.

식 $(a)$ 에서 함수 $f$ 를 $p$ 에 대해 편미분 하려면 쉽지 않다. 여기서 로그 함수의 단조증가 성질을 활용하여 $L^{⋆} = l o g (L)$ 라는 보조 방정식을 도입하도록 한다. 그러면 $L^{⋆}$ 은 다음과 같다.

$L^{⋆} = \log (L) = \sum_{i = 1}^{n} \log (p^{x_{i}} {(1 - p)}^{(1 - x_{i})}) = \sum_{i = 1}^{n} {x_{i} \log (p) + (1 - x_{i}) \log (1 - p)}$

$L^{⋆}$ 의 $p$ 에 대한 편미분이 0이 되는 $p$ 를 찾으면 최대우도를 만족하는 모수 $p$ 를 추정할 수 있다.

$L^{⋆} = \frac{\sum_{i = 1}^{n} x_{i}}{p} - \frac{\sum_{i = 1}^{n} (1 - x_{i})}{1 - p}$

$= \frac{n \bar{X}}{p} - \frac{n (1 - \bar{X})}{p} = 0$

$= \frac{n \bar{X}}{p} = \frac{n (1 - \bar{X})}{p}$

$= (1 - p) \bar{X} = (1 - \bar{X}) p$

$= \bar{X} - p \bar{X} = p - \bar{X} p$

$= p = \bar{X}$

따라서, $p = \bar{X}$ 로 모수 $p$ 를 추정하는 것이 적절하다는 것을 알 수 있다.

생각해보면 자연스러운 것이 모비율 추정 시 현재 모여있는 사람의 성비를 가지고 모비율을 추정할 수 밖에 없고, 아마 그런 모비율이 있었기 때문에 현재 상태가 만들어 진 것은 아닐까? 라고 추정하는 것은 자연스럽다.

저작자표시 비영리

'머신러닝' 카테고리의 다른 글

부스팅 기법의 이해 (0)	2019.03.06
회귀분석 강의노트 (0)	2019.03.06
로지스틱 회귀모델의 모수 추정 (0)	2019.03.06
로지스틱 함수 (0)	2019.03.06
주요 개념 및 관련 문서 (0)	2019.01.29

:

로지스틱 회귀모델의 모수 추정

머신러닝 2019. 3. 6. 14:56

Sigmoid 함수 (logistic function)

$s i g m o i d (x) = \frac{1}{1 + e^{- x}}$

Sigmoid 함수 미분

$\frac{d}{d x} s i g m o i d (x) = \frac{d}{d x} {(1 + e^{- x})}^{- 1}$

$= (- 1) \frac{1}{{(1 + e^{- x})}^{2}} \frac{d}{d x} (1 + e^{- x})$

$= (- 1) \frac{1}{{(1 + e^{- x})}^{2}} (0 + e^{- x}) \frac{d}{d x} (- x)$

$= (- 1) \frac{1}{{(1 + e^{- x})}^{2}} e^{- x} (- 1)$

$= \frac{(1 + e^{- x})}{{(1 + e^{- x})}^{2}} - \frac{1}{{(1 + e^{- x})}^{2}}$

$= \frac{1}{1 + e^{- x}} - \frac{1}{{(1 + e^{- x})}^{2}}$

$= \frac{1}{1 + e^{- x}} (1 - \frac{1}{1 + e^{- x}})$

$= s i g m o i d (x) (1 - s i g m o i d (x))$

$= σ (x)' = σ (x) (1 - σ (x))$

Cost 함수

$C o s t (h_{θ} (x), y) = - y l o g (h_{θ} (x)) - (1 - y) l o g (1 - h_{θ} (x))$

전체 Cost 함수

$j (θ) = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \log (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))]$

Cost 함수 미분

$\frac{\partial}{\partial θ_{j}} j (θ) = \frac{\partial}{\partial θ_{j}} \frac{- 1}{m} \sum_{i = 1}^{m} [y^{(i)} \log (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))]$

$= - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \frac{\partial}{\partial θ_{j}} \log (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \frac{\partial}{\partial θ_{j}} \log (1 - h_{θ} (x^{(i)}))]$

$= - \frac{1}{m} \sum_{i = 1}^{m} [\frac{y^{(i)} \frac{\partial}{\partial θ_{j}} h_{θ} (x^{(i)})}{h_{θ} (x^{(i)})} + \frac{(1 - y^{(i)}) \frac{\partial}{\partial θ_{j}} (1 - h_{θ} (x^{(i)}))}{1 - h_{θ} (x^{(i)})}]$

$= - \frac{1}{m} \sum_{i = 1}^{m} [\frac{y^{(i)} \frac{\partial}{\partial θ_{j}} σ (θ^{T} x^{(i)})}{h_{θ} (x^{(i)})} + \frac{(1 - y^{(i)}) \frac{\partial}{\partial θ_{j}} (1 - σ (θ^{T} x^{(i)}))}{1 - h_{θ} (x^{(i)})}]$

$= - \frac{1}{m} \sum_{i = 1}^{m} [\frac{y^{(i)} σ (θ^{T} x^{(i)}) (1 - σ (θ^{T} x^{(i)})) \frac{\partial}{\partial θ_{j}} θ^{T} x^{(i)}}{h_{θ} (x^{(i)})} + \frac{- (1 - y^{(i)}) σ (θ^{T} x^{(i)}) (1 - σ (θ^{T} x^{(i)})) \frac{\partial}{\partial θ_{j}} θ^{T} x^{(i)}}{1 - h_{θ} (x^{(i)})}]$

$= - \frac{1}{m} \sum_{i = 1}^{m} [\frac{y^{(i)} h_{θ} (x^{(i)}) (1 - h_{θ} (x^{(i)})) \frac{\partial}{\partial θ_{j}} θ^{T} x^{(i)}}{h_{θ} (x^{(i)})} + \frac{- (1 - y^{(i)}) h_{θ} (x^{(i)}) (1 - h_{θ} (x^{(i)})) \frac{\partial}{\partial θ_{j}} θ^{T} x^{(i)}}{1 - h_{θ} (x^{(i)})}]$

$= - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} h_{θ} (x^{(i)}) (1 - h_{θ} (x^{(i)})) x_{j}^{(i)} + - (1 - y^{(i)}) h_{θ} (x^{(i)}) x_{j}^{(i)}]$

$= - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} h_{θ} (x^{(i)}) (1 - h_{θ} (x^{(i)})) + - (1 - y^{(i)}) h_{θ} (x^{(i)})] x_{j}^{(i)}$

$= - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} - y^{(i)} h_{θ} (x^{(i)}) - h_{θ} (x^{(i)}) + y^{(i)} h_{θ} (x^{(i)})] x_{j}^{(i)}$

$= - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} - h_{θ} (x^{(i)})] x_{j}^{(i)}$

$= \frac{1}{m} \sum_{i = 1}^{m} [h_{θ} (x^{(i)}) - y^{(i)}] x_{j}^{(i)}$

Gradient Desent

$R e p e a t {θ_{j} := θ_{j} - α \frac{\partial}{\partial θ_{j}} J (θ)}$

↓

$R e p e a t {θ_{j} := θ_{j} - \frac{α}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) x_{j}^{(i)}}$

저작자표시 비영리

'머신러닝' 카테고리의 다른 글

부스팅 기법의 이해 (0)	2019.03.06
회귀분석 강의노트 (0)	2019.03.06
최대우도법(Maximum Likelihood) (0)	2019.03.06
로지스틱 함수 (0)	2019.03.06
주요 개념 및 관련 문서 (0)	2019.01.29

:

로지스틱 함수

머신러닝 2019. 3. 6. 14:55

로지스틱 함수

$y = \frac{1}{1 + e^{- z}}$

선형 회귀 분석의 경우 모델을 위해 만들어 지는 함수는 아래와 같다.

$y = W x + b$

$y = W_{1} \cdot x_{1} + W_{2} \cdot x_{2} + . . . + W i \cdot x i + b$

$y = W ∙ X$

이 1차 함수는 독립변수 $x$ 가 변화할때 종속변수 $y$ 의 변화를 관찰하는 것이 목적인 함수라고 할때 독립변수 $x$ 와 종속변수 $y$ 는 모두 음의 무한대 $- \infty$ 에서 양의 무한대 $\infty$ 의 범위를 갖는다.

혈압과 나이에 대한 상관 관계를 확인/예측하기 위해 선형 회귀 분석을 사용 할 수 있고 이때 나이와 혈압은 연속형 변수로 1차 함수 그래프로 표현하기에 문제가 없다.

그러나 암의 경우와 같이 발병 여부가 데이터로 주어졌을 경우 종속변수 $y$ 는 발병=1, 정상=0 과 같은 범주형 변수의 범위를 갖게 된다.

발병여부를 선형식으로 표현하기 위해 하루에 담배를 5개피 피는 사람을 기준으로 1의 값을 얻기 위해 기울기 $W$ 를 편의상 1로 놓고 $b$ 는 -4로 초기 설정했을 경우 $y = 1 * 5 - 4 = 1$ 의 결과를 얻을 수 있다.

하지만 담배의 갯수가 10개비로 늘어날 경우 $y = 1 * 10 - 4 = 6$ 으로 발병=1, 정상=0 의 범위를 넘어가게 된다.

즉, 독립변수 $x$ 는 $- \infty$ 에서 $\infty$ 의 범위를 갖는데 반해 종속변수 $y$ 는 1과 0 의 범주를 가지고 있어 기존 선형식으로는 표현이 불가능하다.

종속변수 범위의 확장

종속변수 $y$ 의 범위를 $- \infty$ 에서 $\infty$ 로 확장하기 위해 odds 비 $o d d r a t i o = \frac{p}{1 - p}$ 와 로지트 함수(Logit function) $z = l o g i t (o d d s r a t i o) = l o g (\frac{p}{1 - p})$ 을 이용한다.

odds ratio

실패 확률에 대한 성공 확률의 비율이다. 성공 확률을 $p$ 라고 한다면 실패 확률은 $1 - p$ 가 된다.

이렇게 보았을때 odds 비는 $\frac{p}{1 - p}$ 와 같이 표현할 수 있다.

$p$ 는 0에서 1사이의 값을 가지므로 위 식을 계산해 보면 $p$ 가 가장 작은 0일 경우 $\frac{0}{1 - 0} = 0$ 값을 갖게 되고 $p$ 가 가장 큰 1이 되는 경우 $\frac{1}{1 - 1} = \frac{1}{0} = \infty$ 값을 갖게 된다.

다시 말하면 승산(Odds)이란 사건 A가 발행하지 않을 확률 대비 일어날 확률의 비율을 뜻하며 $o d d s = \frac{P (A)}{P (A^{c})} = \frac{P (A)}{1 - P (A)}$ 와 같이 쓸 수 있다.

승산(Odds)이 커질수록 사건 $A$ 가 발행할 활률이 커진다고 볼 수 있다.

이렇게 odds 비를 적용해 $p$ 에 대해 0부터 $\infty$ 의 범위를 갖는 새로운 함수를 만들 수 있다.

logit function

odds 비를 통해 0부터 $\infty$ 로 확장 시킨 범위를 $- \infty$ 에서 $\infty$ 로 확장하기 위해 odds에 자연로그를 취한다.

$l o g i t (o d d s r a t i o) = l n (\frac{p}{1 - p})$

로지스틱 회귀 모델식 유도

종속 변수 $Y$ 를 1이 될 확률로 두고 식을 세운다.
$P (Y = 1 | X = \vec{x}) = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{n} x_{n} = {\vec{β}}^{T} \vec{x}$
좌변을 승산(odds)로 설정 한다.
$\frac{P (Y = 1 | X = \vec{x})}{1 - P (Y = 1 | X = \vec{x})} = {\vec{β}}^{T} \vec{x}$
좌변(승산)에 자연로그를 취한다.
$l n (\frac{P (Y = 1 | X = \vec{x})}{1 - P (Y = 1 | X = \vec{x})}) = {\vec{β}}^{T} \vec{x}$
$x$ 가 주어졌을 경우 범우 1일 확률을 $p (x)$ , 위 식의 우변을 $a$ 로 치환해 확률 $p$ 에 대한 식을 도출한다.
$\frac{p (x)}{1 - p (x)} = e^{a}$
$p (x) = e^{a} {1 - p (x)} = e^{a} - e^{a} p (x)$
$p (x) (1 + e^{a}) = e^{a}$
$p (x) = \frac{e^{a}}{1 + e^{a}} = \frac{e^{a}}{1 + e^{a}} \cdot \frac{\frac{1}{e^{a}}}{\frac{1}{e^{a}}} = \frac{1}{\frac{1}{e^{a}} + 1} = \frac{1}{1 + e^{- a}}$
$P (Y = 1 | X = \vec{x}) = \frac{1}{1 + e^{- ({\vec{β}}^{T} \vec{x})}}$

이항 로지스틱 회귀의 결정 경계

이항로지스틱 모델에 범주 정보를 모르는 입력벡터 $x$ 를 넣으면 범주 1에 속할 확률을 반환해 준다.

범주 1로 분류할 수 있는 확률값은 다음과 같이 표현 할 수 있다.

$P (Y = 1 | X = \vec{x}) > P (Y = 0 | X = \vec{x})$

범주가 두개이므로 위 식의 좌변을 $p (x)$ 로 치환하면 다음과 같이 식을 정리 할 수 있다.

$p (x) > 1 - p (x)$

$\frac{p (x)}{1 - p (x)} > 1$

$β^{T} x > 0$

마찬가지로 $β^{T} x < 0$ 이면 해당 데이터의 범주를 0 으로 분류할 수 있다. 따라서 로지스틱 모델의 결정경계 (decision boundry) 는 $β^{T} x = 0$ 인 하이퍼플레인 (hyperplane) 이다.

입력벡터가 2차원인 경우 다음과 같이 시각화 할 수 있다.

저작자표시 비영리

'머신러닝' 카테고리의 다른 글

부스팅 기법의 이해 (0)	2019.03.06
회귀분석 강의노트 (0)	2019.03.06
최대우도법(Maximum Likelihood) (0)	2019.03.06
로지스틱 회귀모델의 모수 추정 (0)	2019.03.06
주요 개념 및 관련 문서 (0)	2019.01.29

:

'2019/03'에 해당되는 글 8건

'etc' 카테고리의 다른 글

Workflow of Secure Shell(SSH) Protocol Version 1

Client Authentication methods supported by SSH

SSH Protocol Version 2

'보안' 카테고리의 다른 글

Kaggle

Avazu - Predict whether a mobile ad will be clicked

Data

Beat the benchmark with less than 1MB of memory

CriteoLabs - Display Advertising Challenge

Data

md5sum 확인용

Beat the benchmark with less than 200MB of memory

'머신러닝' 카테고리의 다른 글

'머신러닝' 카테고리의 다른 글

권세혁 교수 - 회귀분석 강의노트

'머신러닝' 카테고리의 다른 글

정의

방법

예시 (모비율 추정)

'머신러닝' 카테고리의 다른 글

Sigmoid 함수 (logistic function)

Sigmoid 함수 미분

Cost 함수

전체 Cost 함수

Cost 함수 미분

Gradient Desent

'머신러닝' 카테고리의 다른 글

로지스틱 함수

종속변수 범위의 확장

odds ratio

logit function

로지스틱 회귀 모델식 유도

이항 로지스틱 회귀의 결정 경계

'머신러닝' 카테고리의 다른 글

티스토리툴바