prototype package

Submodules

prototype.github_repo module

class prototype.github_repo.GithubRepo(strUser, strName)[source]

Bases: object

classmethod fromURL(strURL)[source]

constructor with url instead of user, name

Parameters:strURL – url of the github-repository
Returns:calls the main-constructor
getDevTime()[source]

Gets the devolpment time of the repository in days. This is calculated via the difference of ‘created_at’ - ‘updated_at’

Returns:integer which
getDicFoundWords()[source]

gets the stored dictionary-object :return: dictionaryObject

getFeatureOccurences(lstFeatureNames, lstOccurrence, iMinOccurence=1)[source]

gets the found words with it’s number of occurrences in form of a dictionary

Parameters:
  • lstFeatureNames – vocab list
  • lstOccurrence – list of number of occurrences
  • iMinOccurence – minimum number of hits which are needed
Returns:

dictionaryObject

getFilteredReadme(bApplyStemmer=True, bCheckStopWords=False)[source]

Returns the filtered readme with prepare_words() being applied

Returns:string of the filtered readme
getFilteredRepoDescription(bApplyStemmer=True, bCheckStopWords=False)[source]

gets a filtered version of the description of the repository if the description wasn’t set, an empty string “” will be returned

Parameters:
  • bApplyStemmer – true if the words should be stripped to the stem
  • bCheckStopWords – true if known stopwords such as (the, he, and,...) should be ignored
Returns:

string which contains the filtered form of the description

getIntegerFeatures()[source]

gets the intFeatures as a list

Returns:list of the integer features
getName()[source]

getter method for name

Returns:self.name
getNormedFeatures(lstMeanValues)[source]

returns the features which were normed by dividing them with the mean values

Parameters:lstMeanValues – mean value of every integer feature
Returns:list of the normed integer features
getNumOpenIssue()[source]

gets the number of open issues from the json-main-page

Returns:
getNumWatchers()[source]

gets the number of watcher from the json-main-page :return:

getReadme()[source]

Gets the raw content of the readme of the repository which can either be a README.md or README.rst file. The job for loading and exporting the readme is done by it’s Io-Agent.

Returns:string with the raw content
getRepoDescription()[source]

Gets the full description of the repository which is stored in the json-Api If the description wasn’t set, an empty string “” will be returned

Returns:string which contains the description
getRepoLanguage()[source]

Gets the language from the main json-Api-page which was assigned by github to this repository. If no language was allocated “undetected” will be returned

Returns:string which contains the language (e.g. C++, Java, Python,...)
getRepoLanguageAsVector()[source]

Returns an integer-list with 102 entries All of them are set to 0 except the language which is used

Returns:list
getUser()[source]

getter method for user

Returns:self.user
getWordOccurences(lstVocab)[source]

calculates the number of occurrences of the words given by the vocab list; afterwards this list is divided by the word-length of the readme and multiplied with a factor

Parameters:lstVocab – vocabulary which is used in the CountVectorizer of scikit-learn
Returns:integer list representing the percentage-usage of the vocabulary words
printFeatureOccurences(dicFoundWords)[source]

Prints out every feature with it’s number occurence

Parameters:
  • lstFeatureNames – list of the given features, these are the column of the sparse-matrix
  • lstOccurrence – number of occurence of the individual features (has the same size as lstFeatureName
  • iMinOccurence – minimum threshold to print out the feature (if set to 0 all features are print out)
Returns:

readAttributes()[source]

reads all attributes of the json-file and fills the integer-attributes

Returns:
prototype.github_repo.auto_str(cls)[source]

Method for auto-generating a to string-function which prints out all member-attributes

Parameters:cls – current class
Returns:cls

prototype.interface_repository_classifier module

class prototype.interface_repository_classifier.Interface_RepoClassifier[source]

Bases: object

exportModelToFile()[source]

abstract method Export the model and all prequisites to the directory model/

Returns:
loadModelFromFile()[source]

abstract method Loading of the exported model

Returns:
loadTrainingData(strProjPathFileNameCSV)[source]

abstract method The classifier loads the sample data from a given csv-file

Parameters:strProjPathFileNameCSV – path to the csv-file
Returns:
plotTheResult()[source]

abstract method A plot in which the classification is illustrated

Returns:
predictCategoryFromOwnerRepoName(strUser, strRepoName)[source]

abstract method Predict the category for a repository

Parameters:
  • strUser
  • strRepoName
Returns:

predictCategoryFromURL(strGitHubRepoURL)[source]

abstract method Predict the category

Parameters:strGitHubRepoURL
Returns:
predictResultsAndCompare(strProjPathFileNameCSV)[source]

abstract method Predict a given csv-file and compare the result with the manual classification

Parameters:strProjPathFileNameCSV – path to the csv-file
Returns:
trainModel(lstTrainData, lstTrainLabels)[source]

abstract method The model shall be trained via supervised learning

Parameters:
  • lstTrainData – matrix of the training data
  • lstTrainLabels – list of the associated labels
Returns:

prototype.repository_classifier module

class prototype.repository_classifier.RepositoryClassifier(bUseStringFeatures=True)[source]

Bases: prototype.interface_repository_classifier.Interface_RepoClassifier

exportModelToFile()[source]

exports the trained model and the mean values of the input variables to ‘./model/’ the export is done via joblib.dump() to .pkl-file

Returns:
getLabelAlternative(lstFinalPercentages)[source]

gets the first alternative (the seoond result)

Parameters:lstFinalPercentages – percentages lsit for the single categories
Returns:integer label which describes the category
loadModelFromFile()[source]

loads / imports the model-object from ‘./model/RepositoryClassifier.pkl’ and the list of the mean values from ‘./model/lstMeanValues.pkl’

Returns:
loadTrainingData(strProjPathFileNameCSV='/data/csv/additional_data_sets_cleaned.csv', externalpath=None)[source]

trains the model with a given csv-file. the csv file must have 2 columns URL and CATEGORY. the URL is given in the form ‘https://github.com/owner/repository-name‘ the CATEGORY is given by one of these options ‘DEV’, ‘HW’, ‘EDU’, ‘DOCS’, ‘WEB’, ‘DATA’, ‘OTHER’

Parameters:strProjPathFileNameCSV – file path relative to the project-path where the csv-file is stored
Returns:self.lstTrainData (the scaled and normed data with which the model was trained with), self.lstTrainLabels (the used training labels)
plotTheResult(lstTrainData, lstTrainLabels)[source]

this is currently empty -> see the plots in the GUI instead

Parameters:
  • lstTrainData – matrix which was used for training
  • lstTrainLabels – labels which were used for training
Returns:

predictCategoryFromGitHubRepoObj(tmpRepo)[source]

predicts the category for a GithubRepo-Object :param tmpRepo: GithubRepo-Object :return: iLabel, iLabelAlt, lstFinalPercentages, tmpRepo, lstNormedInputFeatures

predictCategoryFromOwnerRepoName(strUser, strRepoName)[source]

predicts the category for a repository which is given by the user and repo-name

Parameters:
  • strUser – owner of the repository
  • strRepoName – name of the repository
Returns:

predictCategoryFromURL(strGitHubRepoURL)[source]

loads the features of a given repository by URL and the model predicts its category-label

Parameters:strGitHubRepoURL – url to the repository
Returns:label value form 0 - 6, lst of the precentages for the other categories
predictProbaNearestCentroids(matCentroids, lstInputFeatures)[source]

because predictProba was missing in the default functionality for nearest-centroid the probability is now calculated via the distances to the different centroids

Parameters:
  • matCentroids – matrix of the centroids for each category
  • lstInputFeatures – full normed input feature list for which the prediction is based on
Returns:

predictResultsAndCompare(strProjPathFileNameCSV='/data/csv/manual_classification_appendix_b.csv')[source]

loads a csv-file with of layout ‘URL, CATEGORY, CATEGORY_ALTERNATIVE_1,CATEGORY_ALTERNATIVE_2’ the URL is given in the format ‘https://github.com/owner/repository-name‘ the CATEGORY, CATEGORY_ALTERNATIVE_1,CATEGORY_ALTERNATIVE_2 is given by one of these options ‘DEV’, ‘HW’, ‘EDU’,

‘DOCS’, ‘WEB’, ‘DATA’, ‘OTHER’

After the predicition phase the result is compared with the given CATEGORY and CATEGORY_ALTERNATIVES A verification matrix is created and the accuracy is calculated from 0.0 to 1.0

Parameters:strProjPathFileNameCSV – path relative to the project-path where the csv file is stored
Returns:the accuracy value (0.0 - 1.0)
trainModel(lstTrainData, lstTrainLabels)[source]

trains the model called self.clf with the given trainData and trainLabels

Parameters:
  • lstTrainData – list
  • lstTrainLabels
Returns:

Module contents