prototype package¶
Subpackages¶
Submodules¶
prototype.github_repo module¶
-
class
prototype.github_repo.
GithubRepo
(strUser, strName)[source]¶ Bases:
object
-
classmethod
fromURL
(strURL)[source]¶ constructor with url instead of user, name
Parameters: strURL – url of the github-repository Returns: calls the main-constructor
-
getDevTime
()[source]¶ Gets the devolpment time of the repository in days. This is calculated via the difference of ‘created_at’ - ‘updated_at’
Returns: integer which
-
getFeatureOccurences
(lstFeatureNames, lstOccurrence, iMinOccurence=1)[source]¶ gets the found words with it’s number of occurrences in form of a dictionary
Parameters: - lstFeatureNames – vocab list
- lstOccurrence – list of number of occurrences
- iMinOccurence – minimum number of hits which are needed
Returns: dictionaryObject
-
getFilteredReadme
(bApplyStemmer=True, bCheckStopWords=False)[source]¶ Returns the filtered readme with prepare_words() being applied
Returns: string of the filtered readme
-
getFilteredRepoDescription
(bApplyStemmer=True, bCheckStopWords=False)[source]¶ gets a filtered version of the description of the repository if the description wasn’t set, an empty string “” will be returned
Parameters: - bApplyStemmer – true if the words should be stripped to the stem
- bCheckStopWords – true if known stopwords such as (the, he, and,...) should be ignored
Returns: string which contains the filtered form of the description
-
getNormedFeatures
(lstMeanValues)[source]¶ returns the features which were normed by dividing them with the mean values
Parameters: lstMeanValues – mean value of every integer feature Returns: list of the normed integer features
-
getReadme
()[source]¶ Gets the raw content of the readme of the repository which can either be a README.md or README.rst file. The job for loading and exporting the readme is done by it’s Io-Agent.
Returns: string with the raw content
-
getRepoDescription
()[source]¶ Gets the full description of the repository which is stored in the json-Api If the description wasn’t set, an empty string “” will be returned
Returns: string which contains the description
-
getRepoLanguage
()[source]¶ Gets the language from the main json-Api-page which was assigned by github to this repository. If no language was allocated “undetected” will be returned
Returns: string which contains the language (e.g. C++, Java, Python,...)
-
getRepoLanguageAsVector
()[source]¶ Returns an integer-list with 102 entries All of them are set to 0 except the language which is used
Returns: list
-
getWordOccurences
(lstVocab)[source]¶ calculates the number of occurrences of the words given by the vocab list; afterwards this list is divided by the word-length of the readme and multiplied with a factor
Parameters: lstVocab – vocabulary which is used in the CountVectorizer of scikit-learn Returns: integer list representing the percentage-usage of the vocabulary words
-
printFeatureOccurences
(dicFoundWords)[source]¶ Prints out every feature with it’s number occurence
Parameters: - lstFeatureNames – list of the given features, these are the column of the sparse-matrix
- lstOccurrence – number of occurence of the individual features (has the same size as lstFeatureName
- iMinOccurence – minimum threshold to print out the feature (if set to 0 all features are print out)
Returns:
-
classmethod
prototype.interface_repository_classifier module¶
-
class
prototype.interface_repository_classifier.
Interface_RepoClassifier
[source]¶ Bases:
object
-
exportModelToFile
()[source]¶ abstract method Export the model and all prequisites to the directory model/
Returns:
-
loadTrainingData
(strProjPathFileNameCSV)[source]¶ abstract method The classifier loads the sample data from a given csv-file
Parameters: strProjPathFileNameCSV – path to the csv-file Returns:
-
predictCategoryFromOwnerRepoName
(strUser, strRepoName)[source]¶ abstract method Predict the category for a repository
Parameters: - strUser –
- strRepoName –
Returns:
-
predictCategoryFromURL
(strGitHubRepoURL)[source]¶ abstract method Predict the category
Parameters: strGitHubRepoURL – Returns:
-
prototype.repository_classifier module¶
-
class
prototype.repository_classifier.
RepositoryClassifier
(bUseStringFeatures=True)[source]¶ Bases:
prototype.interface_repository_classifier.Interface_RepoClassifier
-
exportModelToFile
()[source]¶ exports the trained model and the mean values of the input variables to ‘./model/’ the export is done via joblib.dump() to .pkl-file
Returns:
-
getLabelAlternative
(lstFinalPercentages)[source]¶ gets the first alternative (the seoond result)
Parameters: lstFinalPercentages – percentages lsit for the single categories Returns: integer label which describes the category
-
loadModelFromFile
()[source]¶ loads / imports the model-object from ‘./model/RepositoryClassifier.pkl’ and the list of the mean values from ‘./model/lstMeanValues.pkl’
Returns:
-
loadTrainingData
(strProjPathFileNameCSV='/data/csv/additional_data_sets_cleaned.csv', externalpath=None)[source]¶ trains the model with a given csv-file. the csv file must have 2 columns URL and CATEGORY. the URL is given in the form ‘https://github.com/owner/repository-name‘ the CATEGORY is given by one of these options ‘DEV’, ‘HW’, ‘EDU’, ‘DOCS’, ‘WEB’, ‘DATA’, ‘OTHER’
Parameters: strProjPathFileNameCSV – file path relative to the project-path where the csv-file is stored Returns: self.lstTrainData (the scaled and normed data with which the model was trained with), self.lstTrainLabels (the used training labels)
-
plotTheResult
(lstTrainData, lstTrainLabels)[source]¶ this is currently empty -> see the plots in the GUI instead
Parameters: - lstTrainData – matrix which was used for training
- lstTrainLabels – labels which were used for training
Returns:
-
predictCategoryFromGitHubRepoObj
(tmpRepo)[source]¶ predicts the category for a GithubRepo-Object :param tmpRepo: GithubRepo-Object :return: iLabel, iLabelAlt, lstFinalPercentages, tmpRepo, lstNormedInputFeatures
-
predictCategoryFromOwnerRepoName
(strUser, strRepoName)[source]¶ predicts the category for a repository which is given by the user and repo-name
Parameters: - strUser – owner of the repository
- strRepoName – name of the repository
Returns:
-
predictCategoryFromURL
(strGitHubRepoURL)[source]¶ loads the features of a given repository by URL and the model predicts its category-label
Parameters: strGitHubRepoURL – url to the repository Returns: label value form 0 - 6, lst of the precentages for the other categories
-
predictProbaNearestCentroids
(matCentroids, lstInputFeatures)[source]¶ because predictProba was missing in the default functionality for nearest-centroid the probability is now calculated via the distances to the different centroids
Parameters: - matCentroids – matrix of the centroids for each category
- lstInputFeatures – full normed input feature list for which the prediction is based on
Returns:
-
predictResultsAndCompare
(strProjPathFileNameCSV='/data/csv/manual_classification_appendix_b.csv')[source]¶ loads a csv-file with of layout ‘URL, CATEGORY, CATEGORY_ALTERNATIVE_1,CATEGORY_ALTERNATIVE_2’ the URL is given in the format ‘https://github.com/owner/repository-name‘ the CATEGORY, CATEGORY_ALTERNATIVE_1,CATEGORY_ALTERNATIVE_2 is given by one of these options ‘DEV’, ‘HW’, ‘EDU’,
‘DOCS’, ‘WEB’, ‘DATA’, ‘OTHER’After the predicition phase the result is compared with the given CATEGORY and CATEGORY_ALTERNATIVES A verification matrix is created and the accuracy is calculated from 0.0 to 1.0
Parameters: strProjPathFileNameCSV – path relative to the project-path where the csv file is stored Returns: the accuracy value (0.0 - 1.0)
-