knn算法的训练数据集需要多大

答案:2 悬赏:50 手机版

解决时间 2021-02-07 23:16

提问者网友：佞臣
2021-02-07 04:33

knn算法的训练数据集需要多大

最佳答案

五星知识达人网友：摆渡翁
2021-02-07 04:57

这个不一定。之所以要分训练集和测试集是因为怕过度拟合（overfitting），所以需要一个测试集来检验确定你建立的模型并不只是适合于这一组数据。我一般都是70%训练集30%测试集。当然，得看数据量有多大，以及复杂程度。只要训练集>=测试集，就不会错，但好不好得具体分析。如果数据量在1000以下的话，最好是k折交叉验证（基本上只要不是特别复杂的数据，都推荐k折交叉验证）。如果要是数据量大于10万的话，最好考虑80：20甚至90:10。

全部回答

1楼网友：患得患失的劫
2021-02-07 06:11

function [ccr,pgroupt]=knnt(x,group,k,dist,xt,groupt) %# %# aim: to classify test set objects or unknown objects with the %# k nearest neighbour method %# %# principle: knn is a supervised, deterministic, non-parametric %# classification method. it uses the majority rule to %# assign new objects to a class. %# it is assumed that the number of objects in each class %# is similar. %# there are no assumptions about the data distribution and %# the variance-covariance matrices of each class. %# there is no limitation of the number of variables when %# the euclidean distance is used. %# however, when the correlation coefficient is used, the %# number of variables must be larger than 1. %# ref: massart d. l., vandeginste b. g. m., deming s. n., %# michotte y. and kaufman l., chemometrics: a textbook, %# chapter 23, 395-397, elsevier science publishers b. v., %# amsterdam 1988. %# %# input: x: (mxn) data matrix with m objects and n variables, %# containing samples of several classes (training set) %# group: (mx1) column vector labelling the m objects from the %# training set %# k: integer, number of nearest neighbours %# dist: integer, %# = 1, euclidean distance %# = 2, correlation coefficient, (no. of variables >1) %# xt: (mtxn) data matrix with mt objects and n variables %# (test set or unknowns) %# groupt: (mtx1) column vector labelling the mt objects from %# the test set %# --> if the new objects are unknown, input []. %# %# output: ccr: scalar, correct classification rate %# pgroupt:row vector, predicted class label for the test set %# 0 means that the object is not classified to any %# class %# %# subroutines: sortlab.m: sorts the group label vector into classes %# %# author: wen wu %# copyright(c) 1997 for chemoac %# fabi, vrije universiteit brussel %# laarbeeklaan 103 1090 jette %# %# version: 1.1 (28/02/1998) %# %# test: andrea candolfi %# function [ccr,pgroupt]=knnt(x,group,k,dist,xt,groupt); if nargin==5, groupt=[]; end % for unknown objects distance=dist; clear dist % change variable if size(group,1)>1, group=group'; % change column vector into row vector groupt=groupt'; % change column vector into row vector end; [m,n]=size(x); % size of the training set if distance==2 & n<2, error('number of variables must > 1'),end % to check the number of variables when using correlation coefficient [mt,n]=size(xt); % size of the test set dis=zeros(mt,m); % initial values for the distance (matrix of zeros) % calculation of the distance for each test set object for i=1:mt for j=1:m % between each training set object and each test set object if distance==1 dis(i,j)=(xt(i,:)-x(j,:))*(xt(i,:)-x(j,:))'; % euclidian distance else r=corrcoef(xt(i,:)',x(j,:)'); % correlation coefficient matrix r=r(1,2); % correlation coefficient dis(i,j)=1-r*r; % 1 - the power of correlation coefficient end end end % finding of the nearest neighbours lab=zeros(1,mt); % initial values of lab for i=1:mt % for each test object [a,b]=sort(dis(i,:)); % sort distances b=b(find(a<=a(k))); % to find the nearest neighbours indices b=group(b); % the nearest neighbours objects [ng,lgroup]=sortlab(b); % calculate the number of objects from each class in the nearest neighbours a=find(ng==max(ng)); % find the class with the maximum number of objects if length(a)==1 % only one class lab(i)=lgroup(a); % class label else lab(i)=0; % more than one class end end % calculation of the success rate if ~isempty(groupt) dif=groupt-lab; % difference between predicted class label and known class label ccr=sum(dif==0)/mt; % success rate end pgroupt=lab; % the output vector

我要举报

如以上回答内容为低俗、色情、不良、暴力、侵权、涉及违法等信息，可以点下面链接进行举报！

点此我要举报以上问答信息