knn算法的训练数据集需要多大
答案:2 悬赏:50 手机版
解决时间 2021-02-07 23:16
- 提问者网友:佞臣
- 2021-02-07 04:33
knn算法的训练数据集需要多大
最佳答案
- 五星知识达人网友:摆渡翁
- 2021-02-07 04:57
这个不一定。之所以要分训练集和测试集是因为怕过度拟合(overfitting),所以需要一个测试集来检验确定 你建立的模型并不只是适合于这一组数据。我一般都是70%训练集30%测试集。当然,得看数据量有多大,以及复杂程度。只要训练集>=测试集,就不会错,但好不好得具体分析。如果数据量在1000以下的话,最好是k折交叉验证(基本上只要不是特别复杂的数据,都推荐k折交叉验证)。如果要是数据量大于10万的话,最好考虑80:20甚至90:10。
全部回答
- 1楼网友:患得患失的劫
- 2021-02-07 06:11
function [ccr,pgroupt]=knnt(x,group,k,dist,xt,groupt)
%#
%# aim: to classify test set objects or unknown objects with the
%# k nearest neighbour method
%#
%# principle: knn is a supervised, deterministic, non-parametric
%# classification method. it uses the majority rule to
%# assign new objects to a class.
%# it is assumed that the number of objects in each class
%# is similar.
%# there are no assumptions about the data distribution and
%# the variance-covariance matrices of each class.
%# there is no limitation of the number of variables when
%# the euclidean distance is used.
%# however, when the correlation coefficient is used, the
%# number of variables must be larger than 1.
%# ref: massart d. l., vandeginste b. g. m., deming s. n.,
%# michotte y. and kaufman l., chemometrics: a textbook,
%# chapter 23, 395-397, elsevier science publishers b. v.,
%# amsterdam 1988.
%#
%# input: x: (mxn) data matrix with m objects and n variables,
%# containing samples of several classes (training set)
%# group: (mx1) column vector labelling the m objects from the
%# training set
%# k: integer, number of nearest neighbours
%# dist: integer,
%# = 1, euclidean distance
%# = 2, correlation coefficient, (no. of variables >1)
%# xt: (mtxn) data matrix with mt objects and n variables
%# (test set or unknowns)
%# groupt: (mtx1) column vector labelling the mt objects from
%# the test set
%# --> if the new objects are unknown, input [].
%#
%# output: ccr: scalar, correct classification rate
%# pgroupt:row vector, predicted class label for the test set
%# 0 means that the object is not classified to any
%# class
%#
%# subroutines: sortlab.m: sorts the group label vector into classes
%#
%# author: wen wu
%# copyright(c) 1997 for chemoac
%# fabi, vrije universiteit brussel
%# laarbeeklaan 103 1090 jette
%#
%# version: 1.1 (28/02/1998)
%#
%# test: andrea candolfi
%#
function [ccr,pgroupt]=knnt(x,group,k,dist,xt,groupt);
if nargin==5, groupt=[]; end % for unknown objects
distance=dist; clear dist % change variable
if size(group,1)>1,
group=group'; % change column vector into row vector
groupt=groupt'; % change column vector into row vector
end;
[m,n]=size(x); % size of the training set
if distance==2 & n<2, error('number of variables must > 1'),end % to check the number of variables when using correlation coefficient
[mt,n]=size(xt); % size of the test set
dis=zeros(mt,m); % initial values for the distance (matrix of zeros)
% calculation of the distance for each test set object
for i=1:mt
for j=1:m % between each training set object and each test set object
if distance==1
dis(i,j)=(xt(i,:)-x(j,:))*(xt(i,:)-x(j,:))'; % euclidian distance
else
r=corrcoef(xt(i,:)',x(j,:)'); % correlation coefficient matrix
r=r(1,2); % correlation coefficient
dis(i,j)=1-r*r; % 1 - the power of correlation coefficient
end
end
end
% finding of the nearest neighbours
lab=zeros(1,mt); % initial values of lab
for i=1:mt % for each test object
[a,b]=sort(dis(i,:)); % sort distances
b=b(find(a<=a(k))); % to find the nearest neighbours indices
b=group(b); % the nearest neighbours objects
[ng,lgroup]=sortlab(b); % calculate the number of objects from each class in the nearest neighbours
a=find(ng==max(ng)); % find the class with the maximum number of objects
if length(a)==1 % only one class
lab(i)=lgroup(a); % class label
else
lab(i)=0; % more than one class
end
end
% calculation of the success rate
if ~isempty(groupt)
dif=groupt-lab; % difference between predicted class label and known class label
ccr=sum(dif==0)/mt; % success rate
end
pgroupt=lab; % the output vector
我要举报
如以上回答内容为低俗、色情、不良、暴力、侵权、涉及违法等信息,可以点下面链接进行举报!
点此我要举报以上问答信息
大家都在看
推荐资讯