prototype package¶

Subpackages¶

prototype.utility_funcs package

Submodules¶

prototype.github_repo module¶

class prototype.github_repo.GithubRepo(strUser, strName)[source]¶

Bases: object

classmethod fromURL(strURL)[source]¶

constructor with url instead of user, name

Parameters:	strURL – url of the github-repository
Returns:	calls the main-constructor

getDevTime()[source]¶

Gets the devolpment time of the repository in days. This is calculated via the difference of ‘created_at’ - ‘updated_at’

Returns:	integer which

getDicFoundWords()[source]¶: gets the stored dictionary-object :return: dictionaryObject

getFeatureOccurences(lstFeatureNames, lstOccurrence, iMinOccurence=1)[source]¶

gets the found words with it’s number of occurrences in form of a dictionary

Parameters:	lstFeatureNames – vocab list lstOccurrence – list of number of occurrences iMinOccurence – minimum number of hits which are needed
Returns:	dictionaryObject

getFilteredReadme(bApplyStemmer=True, bCheckStopWords=False)[source]¶

Returns the filtered readme with prepare_words() being applied

Returns:	string of the filtered readme

getFilteredRepoDescription(bApplyStemmer=True, bCheckStopWords=False)[source]¶

gets a filtered version of the description of the repository if the description wasn’t set, an empty string “” will be returned

Parameters:	bApplyStemmer – true if the words should be stripped to the stem bCheckStopWords – true if known stopwords such as (the, he, and,...) should be ignored
Returns:	string which contains the filtered form of the description

getIntegerFeatures()[source]¶

gets the intFeatures as a list

Returns:	list of the integer features

getName()[source]¶

getter method for name

Returns:	self.name

getNormedFeatures(lstMeanValues)[source]¶

returns the features which were normed by dividing them with the mean values

Parameters:	lstMeanValues – mean value of every integer feature
Returns:	list of the normed integer features

getNumOpenIssue()[source]¶

gets the number of open issues from the json-main-page

Returns:

getNumWatchers()[source]¶: gets the number of watcher from the json-main-page :return:

getReadme()[source]¶

Gets the raw content of the readme of the repository which can either be a README.md or README.rst file. The job for loading and exporting the readme is done by it’s Io-Agent.

Returns:	string with the raw content

getRepoDescription()[source]¶

Gets the full description of the repository which is stored in the json-Api If the description wasn’t set, an empty string “” will be returned

Returns:	string which contains the description

getRepoLanguage()[source]¶

Gets the language from the main json-Api-page which was assigned by github to this repository. If no language was allocated “undetected” will be returned

Returns:	string which contains the language (e.g. C++, Java, Python,...)

getRepoLanguageAsVector()[source]¶

Returns an integer-list with 102 entries All of them are set to 0 except the language which is used

Returns:	list

getUser()[source]¶

getter method for user

Returns:	self.user

getWordOccurences(lstVocab)[source]¶

calculates the number of occurrences of the words given by the vocab list; afterwards this list is divided by the word-length of the readme and multiplied with a factor

Parameters:	lstVocab – vocabulary which is used in the CountVectorizer of scikit-learn
Returns:	integer list representing the percentage-usage of the vocabulary words

printFeatureOccurences(dicFoundWords)[source]¶

Prints out every feature with it’s number occurence

Parameters:	lstFeatureNames – list of the given features, these are the column of the sparse-matrix lstOccurrence – number of occurence of the individual features (has the same size as lstFeatureName iMinOccurence – minimum threshold to print out the feature (if set to 0 all features are print out)
Returns:

readAttributes()[source]¶

reads all attributes of the json-file and fills the integer-attributes

Returns:

prototype.github_repo.auto_str(cls)[source]¶

Method for auto-generating a to string-function which prints out all member-attributes

Parameters:	cls – current class
Returns:	cls

prototype.interface_repository_classifier module¶

class prototype.interface_repository_classifier.Interface_RepoClassifier[source]¶

Bases: object

exportModelToFile()[source]¶

abstract method Export the model and all prequisites to the directory model/

Returns:

loadModelFromFile()[source]¶

abstract method Loading of the exported model

Returns:

loadTrainingData(strProjPathFileNameCSV)[source]¶

abstract method The classifier loads the sample data from a given csv-file

Parameters:	strProjPathFileNameCSV – path to the csv-file
Returns:

plotTheResult()[source]¶

abstract method A plot in which the classification is illustrated

Returns:

predictCategoryFromOwnerRepoName(strUser, strRepoName)[source]¶

abstract method Predict the category for a repository

Parameters:	strUser – strRepoName –
Returns:

predictCategoryFromURL(strGitHubRepoURL)[source]¶

abstract method Predict the category

Parameters:	strGitHubRepoURL –
Returns:

predictResultsAndCompare(strProjPathFileNameCSV)[source]¶

abstract method Predict a given csv-file and compare the result with the manual classification

Parameters:	strProjPathFileNameCSV – path to the csv-file
Returns:

trainModel(lstTrainData, lstTrainLabels)[source]¶

abstract method The model shall be trained via supervised learning

Parameters:	lstTrainData – matrix of the training data lstTrainLabels – list of the associated labels
Returns:

prototype.repository_classifier module¶

class prototype.repository_classifier.RepositoryClassifier(bUseStringFeatures=True)[source]¶

Bases: prototype.interface_repository_classifier.Interface_RepoClassifier

exportModelToFile()[source]¶

exports the trained model and the mean values of the input variables to ‘./model/’ the export is done via joblib.dump() to .pkl-file

Returns:

getLabelAlternative(lstFinalPercentages)[source]¶

gets the first alternative (the seoond result)

Parameters:	lstFinalPercentages – percentages lsit for the single categories
Returns:	integer label which describes the category

loadModelFromFile()[source]¶

loads / imports the model-object from ‘./model/RepositoryClassifier.pkl’ and the list of the mean values from ‘./model/lstMeanValues.pkl’

Returns:

loadTrainingData(strProjPathFileNameCSV='/data/csv/additional_data_sets_cleaned.csv', externalpath=None)[source]¶

trains the model with a given csv-file. the csv file must have 2 columns URL and CATEGORY. the URL is given in the form ‘https://github.com/owner/repository-name‘ the CATEGORY is given by one of these options ‘DEV’, ‘HW’, ‘EDU’, ‘DOCS’, ‘WEB’, ‘DATA’, ‘OTHER’

Parameters:	strProjPathFileNameCSV – file path relative to the project-path where the csv-file is stored
Returns:	self.lstTrainData (the scaled and normed data with which the model was trained with), self.lstTrainLabels (the used training labels)

plotTheResult(lstTrainData, lstTrainLabels)[source]¶

this is currently empty -> see the plots in the GUI instead

Parameters:	lstTrainData – matrix which was used for training lstTrainLabels – labels which were used for training
Returns:

predictCategoryFromGitHubRepoObj(tmpRepo)[source]¶: predicts the category for a GithubRepo-Object :param tmpRepo: GithubRepo-Object :return: iLabel, iLabelAlt, lstFinalPercentages, tmpRepo, lstNormedInputFeatures

predictCategoryFromOwnerRepoName(strUser, strRepoName)[source]¶

predicts the category for a repository which is given by the user and repo-name

Parameters:	strUser – owner of the repository strRepoName – name of the repository
Returns:

predictCategoryFromURL(strGitHubRepoURL)[source]¶

loads the features of a given repository by URL and the model predicts its category-label

Parameters:	strGitHubRepoURL – url to the repository
Returns:	label value form 0 - 6, lst of the precentages for the other categories

predictProbaNearestCentroids(matCentroids, lstInputFeatures)[source]¶

because predictProba was missing in the default functionality for nearest-centroid the probability is now calculated via the distances to the different centroids

Parameters:	matCentroids – matrix of the centroids for each category lstInputFeatures – full normed input feature list for which the prediction is based on
Returns:

predictResultsAndCompare(strProjPathFileNameCSV='/data/csv/manual_classification_appendix_b.csv')[source]¶

loads a csv-file with of layout ‘URL, CATEGORY, CATEGORY_ALTERNATIVE_1,CATEGORY_ALTERNATIVE_2’ the URL is given in the format ‘https://github.com/owner/repository-name‘ the CATEGORY, CATEGORY_ALTERNATIVE_1,CATEGORY_ALTERNATIVE_2 is given by one of these options ‘DEV’, ‘HW’, ‘EDU’,

‘DOCS’, ‘WEB’, ‘DATA’, ‘OTHER’

After the predicition phase the result is compared with the given CATEGORY and CATEGORY_ALTERNATIVES A verification matrix is created and the accuracy is calculated from 0.0 to 1.0

Parameters:	strProjPathFileNameCSV – path relative to the project-path where the csv file is stored
Returns:	the accuracy value (0.0 - 1.0)

trainModel(lstTrainData, lstTrainLabels)[source]¶

trains the model called self.clf with the given trainData and trainLabels

Parameters:	lstTrainData – list lstTrainLabels –
Returns:

prototype package¶

Subpackages¶

Submodules¶

prototype.github_repo module¶

prototype.interface_repository_classifier module¶

prototype.repository_classifier module¶

Module contents¶

Table Of Contents

Previous topic

Next topic

This Page