国产传媒麻豆精东 MD032,啊啊啊流水一级大片

1 原理

1.1 引入

首先，在引入LR(Logistic Regression)模型之前，非常重要的一個(gè)概念是，該模型在設(shè)計(jì)之初是用來(lái)解決0/1二分類(lèi)問(wèn)題，雖然它的名字中有回歸二字，但只是在其線性部分隱含地做了一個(gè)回歸，最終目標(biāo)還是以解決分類(lèi)問(wèn)題為主。

為了較好地掌握 logistic regression 模型，有必要先了解線性回歸模型和梯度下降法兩個(gè)部分的內(nèi)容，可參考以下兩篇文章：

線性回歸 —— Liner Regression

梯度下降法 —— 經(jīng)典的優(yōu)化方法

先回想一下線性回歸，線性回歸模型幫助我們用最簡(jiǎn)單的線性方程實(shí)現(xiàn)了對(duì)數(shù)據(jù)的擬合，然而，這只能完成回歸任務(wù)，無(wú)法完成分類(lèi)任務(wù)，那么 logistics regression 就是在線性回歸的基礎(chǔ)上添磚加瓦，構(gòu)建出了一種分類(lèi)模型。

如果在線性模型 ( ) 的基礎(chǔ)上做分類(lèi)，比如二分類(lèi)任務(wù)，即，直覺(jué)上我們會(huì)怎么做？最直觀的，可以將線性模型的輸出值再套上一個(gè)函數(shù) ，最簡(jiǎn)單的就是“單位階躍函數(shù)”(unit-step function)，如下圖中紅色線段所示。

也就是把看作為一個(gè)分割線，大于 z 的判定為類(lèi)別0，小于 z 的判定為類(lèi)別1。

但是，這樣的分段函數(shù)數(shù)學(xué)性質(zhì)不太好，它既不連續(xù)也不可微。我們知道，通常在做優(yōu)化任務(wù)時(shí)，目標(biāo)函數(shù)最好是連續(xù)可微的。那么如何改進(jìn)呢？

這里就用到了對(duì)數(shù)幾率函數(shù) (形狀如圖中黑色曲線所示)：

單位階躍函數(shù)與對(duì)數(shù)幾率函數(shù)（來(lái)源于周志華《機(jī)器學(xué)習(xí)》）

它是一種“Sigmoid”函數(shù)，Sigmoid 函數(shù)這個(gè)名詞是表示形式S形的函數(shù)，對(duì)數(shù)幾率函數(shù)就是其中最重要的代表。這個(gè)函數(shù)相比前面的分段函數(shù)，具有非常好的數(shù)學(xué)性質(zhì)，其主要優(yōu)勢(shì)如下：

使用該函數(shù)做分類(lèi)問(wèn)題時(shí)，不僅可以預(yù)測(cè)出類(lèi)別，還能夠得到近似概率預(yù)測(cè)。這點(diǎn)對(duì)很多需要利用概率輔助決策的任務(wù)很有用。

對(duì)數(shù)幾率函數(shù)是任意階可導(dǎo)函數(shù)，它有著很好的數(shù)學(xué)性質(zhì)，很多數(shù)值優(yōu)化算法都可以直接用于求取最優(yōu)解。

總的來(lái)說(shuō)，模型的完全形式如下：

其實(shí)，LR 模型就是在擬合

1.2 損失函數(shù)

對(duì)于任何機(jī)器學(xué)習(xí)問(wèn)題，都需要先明確損失函數(shù)，LR模型也不例外，在遇到回歸問(wèn)題時(shí)，通常我們會(huì)直接想到如下的損失函數(shù)形式 (平均誤差平方損失 MSE)：

但在 LR 模型要解決的二分類(lèi)問(wèn)題中，損失函數(shù)式什么樣的呢？先給出這個(gè)損失函數(shù)的形式，可以看一看思考一下，然后再做解釋。

這個(gè)損失函數(shù)通常稱(chēng)作為對(duì)數(shù)損失 (logloss)，這里的對(duì)數(shù)底為自然對(duì)數(shù) e ，其中真實(shí)值 y是有 0/1 兩種情況，而推測(cè)值由于借助對(duì)數(shù)幾率函數(shù)，其輸出是介于0~1之間連續(xù)概率值。仔細(xì)查看，不難發(fā)現(xiàn)，當(dāng)真實(shí)值 y=0 時(shí)，第一項(xiàng)為0，當(dāng)真實(shí)值 y=1時(shí)，第二項(xiàng)為0，所以，這個(gè)損失函數(shù)其實(shí)在每次計(jì)算時(shí)永遠(yuǎn)都只有一項(xiàng)在發(fā)揮作用，那這不就可以轉(zhuǎn)換為分段函數(shù)了嗎，分段的形式如下：

不難發(fā)現(xiàn)，當(dāng)真實(shí)值 y 為1時(shí)，輸出值越接近1，則 L 越小，當(dāng)真實(shí)值 y 為 0 時(shí)，輸出值越接近于0，則 L 越小 (可自己手畫(huà)一下函數(shù)的曲線)。該分段函數(shù)整合之后就是上面我們所列出的 logloss 損失函數(shù)的形式。

1.3 優(yōu)化求解

現(xiàn)在我們已經(jīng)確定了模型的損失函數(shù)，那么接下來(lái)就是根據(jù)這個(gè)損失函數(shù)，不斷優(yōu)化模型參數(shù)從而獲得擬合數(shù)據(jù)的最佳模型。

重新看一下?lián)p失函數(shù)，其本質(zhì)上是 L 關(guān)于模型中線性方程部分的兩個(gè)參數(shù) w 和 b 的函數(shù)：

其中，

現(xiàn)在的學(xué)習(xí)任務(wù)轉(zhuǎn)化為數(shù)學(xué)優(yōu)化的形式即為：

由于損失函數(shù)連續(xù)可微，我們就可以借助梯度下降法進(jìn)行優(yōu)化求解，對(duì)于連個(gè)核心參數(shù)的更新方式如下：

計(jì)算到這里，很有意思的事情發(fā)生了：

計(jì)算了半天原來(lái)變得如此簡(jiǎn)單，就是推測(cè)值和真實(shí)值 Y 之間的差值，其實(shí)這也是得益于對(duì)數(shù)幾率函數(shù)本身很好的數(shù)學(xué)性質(zhì)。

再接再厲，求得：

2 代碼實(shí)現(xiàn)

下面我們開(kāi)始用 python 自己實(shí)現(xiàn)一個(gè)簡(jiǎn)單的 LR 模型。

完整代碼可參考：[link]

首先，建立 logistic_regression.py 文件，構(gòu)建 LR 模型的類(lèi)，內(nèi)部實(shí)現(xiàn)了其核心的優(yōu)化函數(shù)。

# -*- coding: utf-8 -*-import numpy as npclass LogisticRegression(object): def __init__(self, learning_rate=0.1, max_iter=100, seed=None): self.seed = seed self.lr = learning_rate self.max_iter = max_iter def fit(self, x, y): np.random.seed(self.seed) self.w = np.random.normal(loc=0.0, scale=1.0, size=x.shape[1]) self.b = np.random.normal(loc=0.0, scale=1.0) self.x = x self.y = y for i in range(self.max_iter): self._update_step() # print('loss: \t{}'.format(self.loss())) # print('score: \t{}'.format(self.score())) # print('w: \t{}'.format(self.w)) # print('b: \t{}'.format(self.b)) def _sigmoid(self, z): return 1.0 / (1.0 + np.exp(-z)) def _f(self, x, w, b): z = x.dot(w) + b return self._sigmoid(z) def predict_proba(self, x=None): if x is None: x = self.x y_pred = self._f(x, self.w, self.b) return y_pred def predict(self, x=None): if x is None: x = self.x y_pred_proba = self._f(x, self.w, self.b) y_pred = np.array([0 if y_pred_proba[i] < 0.5 else 1 for i in range(len(y_pred_proba))]) ? ? ? ?return y_pred ? ?def score(self, y_true=None, y_pred=None): ? ? ? ?if y_true is None or y_pred is None: ? ? ? ? ? ?y_true = self.y ? ? ? ? ? ?y_pred = self.predict() ? ? ? ?acc = np.mean([1 if y_true[i] == y_pred[i] else 0 for i in range(len(y_true))]) ? ? ? ?return acc ? ?def loss(self, y_true=None, y_pred_proba=None): ? ? ? ?if y_true is None or y_pred_proba is None: ? ? ? ? ? ?y_true = self.y ? ? ? ? ? ?y_pred_proba = self.predict_proba() ? ? ? ?return np.mean(-1.0 * (y_true * np.log(y_pred_proba) + (1.0 - y_true) * np.log(1.0 - y_pred_proba))) ? ?def _calc_gradient(self): ? ? ? ?y_pred = self.predict() ? ? ? ?d_w = (y_pred - self.y).dot(self.x) / len(self.y) ? ? ? ?d_b = np.mean(y_pred - self.y) ? ? ? ?return d_w, d_b ? ?def _update_step(self): ? ? ? ?d_w, d_b = self._calc_gradient() ? ? ? ?self.w = self.w - self.lr * d_w ? ? ? ?self.b = self.b - self.lr * d_b ? ? ? ?return self.w, self.b

然后，這里我們創(chuàng)建了一個(gè) 文件，單獨(dú)用于創(chuàng)建模擬數(shù)據(jù)，并且內(nèi)部實(shí)現(xiàn)了訓(xùn)練/測(cè)試數(shù)據(jù)的劃分功能。

# -*- coding: utf-8 -*-import numpy as npdef generate_data(seed): np.random.seed(seed) data_size_1 = 300 x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1) x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1) y_1 = [0 for _ in range(data_size_1)] data_size_2 = 400 x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2) x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2) y_2 = [1 for _ in range(data_size_2)] x1 = np.concatenate((x1_1, x1_2), axis=0) x2 = np.concatenate((x2_1, x2_2), axis=0) x = np.hstack((x1.reshape(-1,1), x2.reshape(-1,1))) y = np.concatenate((y_1, y_2), axis=0) data_size_all = data_size_1+data_size_2 shuffled_index = np.random.permutation(data_size_all) x = x[shuffled_index] y = y[shuffled_index] return x, ydef train_test_split(x, y): split_index = int(len(y)*0.7) x_train = x[:split_index] y_train = y[:split_index] x_test = x[split_index:] y_test = y[split_index:] return x_train, y_train, x_test, y_test

最后，創(chuàng)建 train.py 文件，調(diào)用之前自己寫(xiě)的 LR 類(lèi)模型實(shí)現(xiàn)分類(lèi)任務(wù)，查看分類(lèi)的精度。

# -*- coding: utf-8 -*-import numpy as npimport matplotlib.pyplot as pltimport data_helperf rom logistic_regression import *# data generationx, y = data_helper.generate_data(seed=272)x_train, y_train, x_test, y_test = data_helper.train_test_split(x, y)# visualize data# plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')# plt.show()# plt.scatter(x_test[:,0], x_test[:,1], c=y_test, marker='.')# plt.show()# data normalizationx_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))# Logistic regression classifierclf = LogisticRegression(learning_rate=0.1, max_iter=500, seed=272)clf.fit(x_train, y_train)# plot the resultsplit_boundary_func = lambda x: (-clf.b - clf.w[0] * x) / clf.w[1]xx = np.arange(0.1, 0.6, 0.1)plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')plt.plot(xx, split_boundary_func(xx), c='red')plt.show()# loss on test sety_test_pred = clf.predict(x_test)y_test_pred_proba = clf.predict_proba(x_test)print(clf.score(y_test, y_test_pred))print(clf.loss(y_test, y_test_pred_proba))# print(y_test_pred_proba)

輸出結(jié)果圖如下：

輸出的分類(lèi)結(jié)果圖

紅色直線即為 LR 模型中的線性方程，所以本質(zhì)上 LR 在做的就是不斷擬合這條紅色的分割邊界使得邊界兩側(cè)的類(lèi)別正確率盡可能高。因此，LR 其實(shí)是一種線性分類(lèi)器，對(duì)于輸入數(shù)據(jù)的分布為非線性復(fù)雜情況時(shí)，往往需要我們采用更復(fù)雜的模型，或者繼續(xù)在特征工程上做文章。

聲明：本文內(nèi)容及配圖由入駐作者撰寫(xiě)或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問(wèn)題，請(qǐng)聯(lián)系本站處理。舉報(bào)投訴

函數(shù)

函數(shù)

+關(guān)注

關(guān)注
3

文章
4381

瀏覽量
64904
線性

線性

+關(guān)注

關(guān)注
0

文章
204

瀏覽量
25649
Logistic

Logistic

+關(guān)注

關(guān)注
0

文章
11

瀏覽量
8985

原文標(biāo)題：對(duì)數(shù)幾率回歸 —— Logistic Regression

文章出處：【微信號(hào)：AI_shequ，微信公眾號(hào)：人工智能愛(ài)好者社區(qū)】歡迎添加關(guān)注！文章轉(zhuǎn)載請(qǐng)注明出處。

一区二区三区三上|欧美在线视频五区|国产午夜无码在线观看视频|亚洲国产裸体网站|无码成年人影视|亚洲AV亚洲AV|成人开心激情五月|欧美性爱内射视频|超碰人人干人人上|一区二区无码三区亚洲人区久久精品

搜索歷史

掌握l(shuí)ogistic regression模型，有必要先了解線性回歸模型和梯度下降法

評(píng)論