pici
community
Community
Bases: ABC
Abstract community class.
__eq__(other)
Two communities are the same if they have the same name. This is used to simplify caching.
_generate_temporal_graph(start=None, end=None, kind='co_contributor')
Generate a graph based only on posts created after start (>) and before end (<).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start |
datetime |
None
|
|
end |
datetime |
None
|
|
kind |
string (one of 'co_contributor', 'commenter') |
'co_contributor'
|
temporal_graph(start=None, end=None, kind='co_contributor')
Cached access to temporal graphs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start |
None or datetime start of graph snapshot |
None
|
|
end |
None or datetime end of graph snapshot |
None
|
|
kind |
string ('co_contributor' or 'commenter'). |
'co_contributor'
|
datatypes
CommunityDataLevel
Bases: Enum
View on community.
TODO
Document properly
MetricReturnType
Bases: Enum
Category of representation of metrics' return type.
DATAFRAME = 'dataframe'
class-attribute
Metric's return values as series in Pandas.Dataframe
PLAIN = 'plain'
class-attribute
Metric's original return value (not modified).
TABLE = 'table'
class-attribute
Metric's return value(s) in table format.
helpers
aggregate(dict_of_series, aggregations=[np.mean, np.min, np.max, np.std, np.sum])
Applies a number of aggregations to the series supplied as values in
dict_of_series
. Keys are names of series, the name of the
aggregation is appended to the series names as "(agg-name)".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aggregations |
list of aggregation functions |
[np.mean, np.min, np.max, np.std, np.sum]
|
|
dict_of_series |
dict of indicator_name:Pandas.Series |
required |
Returns:
Type | Description |
---|---|
dict of formatted indicator_name: aggregated series |
apply_to_initial_posts(community, new_cols, func)
Applies func
to initial posts (community.posts
where
post_position_in_thread==1
). Returns DataFrame with topic_column
field as index. Cols in retured df are named according to strings in
new_cols
, values in cols in order of values returned by func
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community |
required | |
new_cols |
list of strings |
required | |
func |
function to apply to each initial post from community.posts |
required |
and indexed by thread-ids.
as_table(func)
Decorator that returns results as table, indexed with community name. TODO: document
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
required |
create_co_contributor_graph(link_data, node_data, node_col, group_col, node_attributes, connected=True)
Creates a networkx.Graph with nodes=users and edges if two users have contributed to the same thread. Edge weights = number of threads where two users co-contributed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
link_data |
required | ||
node_data |
required | ||
node_col |
required | ||
group_col |
required | ||
node_attributes |
required | ||
connected |
True
|
create_commenter_graph(link_data, node_data, node_col, group_col, node_attributes, conntected=True)
Creates a networkx.DiGraph with nodes=users and directed edges a->b if a has replied to an initial post by b. Edge weight is the number of comments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
link_data |
required | ||
node_data |
required | ||
node_col |
required | ||
group_col |
required | ||
node_attributes |
required | ||
conntected |
True
|
flat(df, columns='community_name')
Returns a pivoted version of df
with flattened index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pd.DataFrame
|
Pandas.DataFrame |
required |
columns |
str
|
Column name to pivot on. |
'community_name'
|
generate_indicator_results(posts, initial_post, feedback, indicator_text, column, aggs=[np.sum, np.mean, np.min, np.max, np.std])
Returns results from column
in DataFrames posts
,
initial_post
, and feedback
as different aggregations
(sum, mean, ...). Initial post is only aggregated as sum. Output is a
dict with df/agg: value, e.g. "posts indicator_text (mean)":value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
posts |
required | ||
initial_post |
required | ||
feedback |
required | ||
indicator_text |
required | ||
column |
required |
join_df(func)
Decorator that joins results to existing dataframe in community. TODO: document
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
required |
merge_dfs(dfs, only_unique=False)
Wrapper for Pandas.merge(). Merges DataFrames, so that
TODO: document
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dfs |
Iterable[pd.DataFrame]
|
required | |
only_unique |
bool
|
False
|
num_words(text)
Counts the number of words in a text. Does account for html tags and comments (not included in count).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Text to count words in. |
required |
Returns:
Name | Type | Description |
---|---|---|
count |
int
|
Number of words. |
series_most_common(series)
Get most common element from Pandas.Series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
series |
pd.Series
|
Pandas.Series |
required |
where_all(conditions)
Concatenates logical condition with and
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
conditions |
required |
word_occurrences(text, words)
Counts the number of occurrences of specified words
in text
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
A text with words. |
required |
words |
list of str
|
Words. |
required |
Returns:
Name | Type | Description |
---|---|---|
occurrences |
dict of str:int
|
|
A |
labelling
InnovationLabels
Bases: Labels
TODO: add documentation
from_limesurvey(limesurvey_results, drop_labellers=None)
Adds label entries from Limesurvey results format. Limesurvey results can contain multiple labelled threads per response. For each thread i and associated url and labels, the data must contain one column, e.g.: "thread1" (=url), "labelA1", "labelB1", ..., "thread2", "labelA2", "labelB2", ...
Parameters:
Name | Type | Description | Default |
---|---|---|---|
limesurvey_results |
String (path to file) or Pandas.DataFrame |
required |
LabelCollection
TODO: add documentation
all_label_names()
property
TODO: add documentation
by_level(level)
TODO: Add documentation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
required |
labels()
property
TODO: add documentation
LabelStats
This class provides metrics and visualizations to analyze the annotations made by the labellers.
Available metrics are
- % agreement ("a_0")
- Cohen's kappa (two labellers)
- Fleiss' kappa (multiple labellers)
- Krippendorff's alpha (multiple labellers, missing data)
Labellers' annotations can furthermore be evaluated against a subsample of "goldstandard" annotations, allowing to associate labellers with a quality- score.
TODO refactor --> move visualizations to visualizations.py ?
See [1] for a comparison of inter-rater agreement metrics.
[1] Xinshu Zhao, Jun S. Liu & Ke Deng (2013) Assumptions behind Intercoder Reliability Indices, Annals of the International Communication Association, 36:1, 419-480, DOI: 10.1080/23808985.2013.11679142
_melt_goldstandard_agreement(data)
Prepares interrater_agreement
dataframe for plotting (e.g., with
seaborn): wide format to long format, shorten label names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Pandas.DataFrame with label names as index, "labellers"- |
required |
Returns:
Type | Description |
---|---|
Pandas.DataFrame with columns: - index: shortened label names - labellers: labeller names - variable: metric name - value: metric value |
cohen_kappa()
Get Cohen's kappa for all labels, using scikit-learn implementation
sklearn.metrics.cohen_kappa_score
.
Returns NaN if number of labellers != 2.
Returns:
Type | Description |
---|---|
dict of (label name, kappa) |
complete_agreement()
Get percentage of cases where all labellers agree (per label).
Returns:
Name | Type | Description |
---|---|---|
agreement | Pandas.DataFrame with 'label', '% perfect agreement', |
|
'% n' |
fleiss_kappa()
Get Fleiss kappa for all labels, based on Statsmodels
implementation (statsmodels.stats.inter_rater.fleiss_kappa
).
Returns NaN if number of labellers < 2.
Returns:
Type | Description |
---|---|
dict of (label name, kappa) |
interrater_agreement()
Calculated the overall interrater agreement for all labellers in data. If number of labellers > 2, all values for Cohen/Fleiss kappa will be NaN.
Returns:
Type | Description |
---|---|
agreement dataframe |
krippendorff_alpha()
Get Krippendorff alphas using the ''krippendorff'' package.
See also: Andrew F. Hayes & Klaus Krippendorff (2007) Answering the Call for a Standard Reliability Measure for Coding Data, Communication Methods and Measures, 1:1, 77-89, DOI: 10.1080/19312450709336664
Returns:
Type | Description |
---|---|
Pandas.DataFrame |
pairwise_interrater_agreement(goldstandard=None, min_comparisons=1)
Calculates the agreement metrics for all combinations of two labellers. If goldstandard is set (name of labeller), only comparisons with the goldstandard are calculated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
goldstandard |
Name of labeller |
None
|
|
min_comparisons |
Minimum number of shared labelled cases |
1
|
Returns:
Type | Description |
---|---|
Agreement dataframe with "labellers" column that contains |
|
|
|
|
plot_goldstandard_agreement(kind='label_boxplots', goldstandard=None, data=None)
Plot the labellers' agreement with goldstandard. Provides different
plots through kind
:
- label_boxplots: values: each labeller's agreement with
goldstandard, x: metric, y: boxplot of values
- labellers_points: values: each labeller's agreement with
goldstandard, grid, col per metric, x: labels, y: values
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind |
One of {'label_boxplots','labellers_points'} |
'label_boxplots'
|
|
goldstandard |
Name of labeller to use as goldstandard (used if data is None, generates data) |
None
|
|
data |
agreement dataframe generated by
|
None
|
Labels
Bases: ABC
TODO: add documentation
__init__(data=None, cols=DEFAULT_COLS, filter=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
pd.DataFrame
|
None
|
|
cols |
dict
|
DEFAULT_COLS
|
append(data, cols=DEFAULT_COLS, drop_labellers=None)
TODO: add documentation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
pd.DataFrame
|
required | |
cols |
dict
|
DEFAULT_COLS
|
|
drop_labellers |
None
|
data_by_label(format='sklearn', dropna=False)
TODO: add documentation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dropna |
False
|
labellers()
property
TODO: add documentation
rating_table(label_name, communities=None, custom_filter=None, allow_missing_data=False)
Get the rating table for one label to be used, e.g.,
with statsmodels.stats.inter_rater
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
communities |
List of communities to include in table, |
None
|
|
label_name |
label to be returned in table |
required | |
allow_missing_data |
whether to drop columns with missing ratings |
False
|
Returns:
Type | Description |
---|---|
rating table: labels as 2-dim table with raters (labellers) in |
|
rows and ratings in columns. |
set_filter(f)
TODO: add documentation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
f |
required |
metrics
basic
Basic metrics based on counts, dates etc. of posts, contributors.
By level of observation / concept:
topics
community
agg_number_of_posts_per_interval(community, interval)
Number of posts per interval
.
Total number of posts in community per interval
(parameter).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community
|
required | |
interval |
str
|
The interval over which to aggregate.
See |
required |
Returns:
Name | Type | Description |
---|---|---|
results |
dict of str:int
|
|
agg_posts_per_topic(community)
Min, max, and average number of posts authored per topic.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required |
Returns:
Name | Type | Description |
---|---|---|
results |
dict of str:int
|
|
contributors_per_interval(community, interval)
Number of users that have authored at least one post in time interval.
TODO
- document
- add to TOC
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
interval |
required |
lorenz(community)
Distribution of posts (in analogy to lorenz curve). Returns (x,y) where x is the (least-contributing) bottom x% of users, and y the proportion of posts made by them.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
report: - % contributors - % posts |
required |
number_of_contributors_per_topic(community)
Number of different contributors that have authored at least one post in a thread.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community
|
required |
Returns:
Name | Type | Description |
---|---|---|
results |
dict of str: int
|
|
number_of_posts(community)
Total number of posts authored by community.
TODO
document
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required |
number_of_posts_per_topic(community)
Number of posts per topic.
TODO
- add to toc
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required |
Returns:
Name | Type | Description |
---|---|---|
report |
|
number_of_words(community)
The number of words in a post (removing html).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community
|
required |
Returns:
Name | Type | Description |
---|---|---|
results |
dict of str:int
|
|
post_dates_per_topic(community)
Date of first post, second post, and last post.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community
|
required |
Returns:
Name | Type | Description |
---|---|---|
results |
dict of str:date
|
|
post_delays_per_topic(community)
Delays (in days) between first and second post, and first and last post.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community
|
required |
Returns:
Name | Type | Description |
---|---|---|
results |
dict of str:int
|
|
posts_per_interval(community, interval)
Number of posts authored by community per time interval.
TODO
- document
- add to TOC
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
interval |
required |
posts_word_occurrence(community, words, normalize=True)
Counts the occurrence of a set of words in each post.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community
|
required | |
words |
list of str
|
List of words to count in post texts. |
required |
normalize |
bool
|
Normalize occurrence count by text length. |
True
|
Returns:
Name | Type | Description |
---|---|---|
results |
dict of str:int
|
|
cached_metrics
This is a collection of all cachable functions that are used in the
calculation of indicators. The cache is implemented using
functools.lru_cache
with maxsize=None
. Caching is commonly done at
least on community level (pici.Community is hashable). Examples for when
using a cache makes sense:
- calculating the similarity of post texts (done once for all combinations)
- generating "temporal networks" (filtered representations of networks, depending on dates of posts)
It is recommended to define cached parts of indicators here.
_comments_by_contributor(community, contributor, date_limit=None)
Get all threads initiated by contributor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
contributor |
User name |
required | |
date_limit |
Date in string format, e.g. '2020-01-15' |
None
|
specified user (before the specified date_limit).
_contribution_regularity(community, contributor, start, end)
Get the contribution regularity of contributor
as the percentage of
days that contributor posted in the forum, between the dates start
and end
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
contributor |
required | ||
start |
required | ||
end |
required |
_initial_post_author_network_metric(initial_post, community, metric, kind)
Get a cached network metric for the author of an initial post.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
initial_post |
required | ||
metric |
required | ||
community |
required | ||
thread_date |
required | ||
kind |
required |
Returns:
Type | Description |
---|---|
The value of the metric. |
_replies_to_own_topics(community, contributor, date_limit=None)
The number of replies made to initial posts by specified contributor in community.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
contributor |
required | ||
date_limit |
Date in string format, e.g. '2020-01-15' |
None
|
contributor. If date_limit is provided, only threads & replies posted before the date limit are considered.
_temporal_text_similarity_dict(community, date, text_col='preprocessed_text__words_no_stop', similarity_metric='token_sort_ratio')
Returns a dictionary of post-text:1xn-similarity-matrix for similarity subgraph filtered by date.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
date |
required | ||
text_col |
'preprocessed_text__words_no_stop'
|
||
similarity_metric |
'token_sort_ratio'
|
_temporal_text_similarity_network(community, date, text_col='preprocessed_text__words_no_stop', similarity_metric='token_sort_ratio', only_initial_posts=True)
Create a subview graph of the text similarity network created by
`_text_similarity_network()
by filtering out all nodes (=posts)
where post.date is > date.
Args: community: date: text_col: similarity_metric: only_initial_posts:
Returns:
_text_similarity_network(community, text_col='preprocessed_text__words_no_stop', similarity_metric='token_sort_ratio', only_initial_posts=True)
Create a text-similarity network for all posts in community, using
textacy.representations.network.build_similarity_network()
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
text_col |
'preprocessed_text__words_no_stop'
|
||
similarity_metric |
'token_sort_ratio'
|
||
only_initial_posts |
True
|
_threads_by_contributor(community, contributor, date_limit=None)
Get all threads initiated by contributor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
contributor |
User name |
required | |
date_limit |
Date in string format, e.g. '2020-01-15' |
None
|
specified user (before the specified date_limit).
distinctiveness
initial_post_text_distance(community, similarity_metric='token_sort_ratio')
Calculates the text distance of initial posts to previously authored initial posts as a measure of distinctiveness.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
similarity_metric |
'token_sort_ratio'
|
||
agg_method |
required |
elaboration
basic_text_based_elaboration(community, col_n_words='preprocessed_text__n_words', col_n_words_no_stop='preprocessed_text__n_words_no_stop', col_syllables='preprocessed_text__n_syllables', col_avg_syllables='preprocessed_text__avg_syllables_per_word', col_smog_index='preprocessed_text__smog_index', col_auto_readability='preprocessed_text__automated_readability_index', col_coleman_liau='preprocessed_text__coleman_liau_index', col_flesch_kincaid='preprocessed_text__flesch_kincaid_grade_level', col_frac_uppercase='preprocessed_text__frac_uppercase', col_frac_punctuation_marks='preprocessed_text__frac_punctuation_marks')
Provides basic text-based elaboration indicators, such as number of words, number of syllables, and different readability scores.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col_flesch_kincaid |
'preprocessed_text__flesch_kincaid_grade_level'
|
||
col_coleman_liau |
'preprocessed_text__coleman_liau_index'
|
||
col_auto_readability |
'preprocessed_text__automated_readability_index'
|
||
col_smog_index |
'preprocessed_text__smog_index'
|
||
col_syllables |
column name in community.posts to use for mean number |
'preprocessed_text__n_syllables'
|
|
col_n_words |
column name in community.posts to use for word count |
'preprocessed_text__n_words'
|
|
community |
required |
experience
initiator_experience_by_commenter_network_out_deg_centrality(community)
Determines a thread initiator's 'experience' by their out-degree centrality in the commenter network at the time of thread creation, i.e., the number of users the initiator has 'commented on' (has replied to in a user's thread).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required |
initiator_experience_by_past_contributions(community, ignore_temporal_dependency=True, use_rounded_date=False)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ignore_temporal_dependency |
True
|
||
community |
required |
helpfulness
initiator_helpfulness_by_contribution_regularity(community, lookback_days=100)
Calculates the contribution regularity of the initiator of each thread.
Contribution regularity is the percentage of past days in which the
initiator posted in the forum. Past days is limited by lookback_days
parameter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lookback_days |
100
|
||
community |
required |
initiator_helpfulness_by_foreign_thread_comment_frequency(community)
This indicator measures initiator helpfulness by the frequency of comments by the thread's initiator that were posted in threads with a different initiator ('foreign threads').
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required |
initiator_helpfulness_by_top_commenter_status(community, contributor, k=90)
Calculates whether a thread's initiator has top commenter status. A 'top
commenter' has posted more comments than the k
-th percentile (default:
k=90).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
contributor |
required |
idea_popularity
idea_popularity_by_number_of_unique_users_commenting(community)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required |
network_position
Metrics using the community's graph object (representation of contributor network).
By level of observation:
contributors
- [contributor_degree][pici.metrics.network.contributor_degree]
- [contributor_centralities][pici.metrics.network.contributor_centralities]
- [contributor_communities][pici.metrics.network.contributor_communities]
co_contributor_centralities(community)
Contributor centralities.
Includes degree centrality, betweenness centrality, and eigenvector centrality.
Using networkx
implementation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community
|
required |
co_contributor_communities(community, leiden_lib='cdlib')
Find communities within the contributor network.
Uses weighted Leiden algorithm (Traag et al., 2018) implemented in
cdlib.algorithms.leiden
or leidgenalg
.
Traag, Vincent, Ludo Waltman, and Nees Jan van Eck. From Louvain to Leiden: guaranteeing well-connected communities. arXiv preprint arXiv:1810.08473 (2018).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
leiden_lib |
Which Leiden alg. implementation to use, 'cdlib' or |
'cdlib'
|
|
community |
required |
Returns:
Name | Type | Description |
---|---|---|
node_communities_map |
dict of node:list(communities)
|
List of |
communities a contributor belongs to. See [ |
||
|
||
(https://cdlib.readthedocs.io/en/latest/reference/classes |
||
/node_clustering.html). |
co_contributor_degree(community)
Number of contributors each contributor has co-authored with in a thread.
Using implementation of networkx.Graph.degree
.
TODO
document
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
pici.Community
|
required |
initiator_centrality_in_co_contributor_network(community, k=None)
TODO: implement using _initial_post_author_network_metric()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
k |
None
|
reports
Reports are groups of metrics evaluated for all communities under analysis. See also: building reports.
Reports by level of observation:
community
posts_contributors_per_interval(pici, interval)
Number of contributors and posts for each time interval
.
TODO
- document
- add to TOC
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pici |
required | ||
interval |
required |
Returns:
Name | Type | Description |
---|---|---|
report |
|
status_reputation
initiator_prestige_by_commenter_network_in_deg_centrality(community)
Determines a thread initiator's 'prestige' by their degree centrality in the commenter network at the time of thread creation, i.e., the number of users that have commented on at least one of their threads at that time.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required |
number_of_replies_to_topics_initiated_by_thread_initiator(community, ignore_temporal_dependency=True)
Bla.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ignore_temporal_dependency |
Whether to simple count all replies, |
True
|
|
community |
required |
pici
Pici
TODO
Add documentation.
Examples:
from communities import OEMCommunityFactory, OSMCommunityFactory, PPCommunityFactory
p = Pici(
communities={
'OpenEnergyMonitor': OEMCommunityFactory,
'OpenStreetMap': OSMCommunityFactory,
'PreciousPlastic': PPCommunityFactory,
},
start='2017-01-01',
end='2017-12-01',
cache_nrows=5000
)
__init__(communities=None, labels=[], cache_dir='cache', cache_nrows=None, start=None, end=None)
Loads communities.
Communities can be loaded from cache or scraped. Loaded data can be restricted either
by number of rows loaded from cache (cache_nrows
), or by setting start
and
end
dates (filter on publication dates of posts).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
communities |
dict of str
|
pici.CommunityFactory): Dictionary of communities.
Communities are provided as |
None
|
cache_dir |
str
|
Path to folder that contains cache files. |
'cache'
|
cache_nrows |
int
|
Number of rows to load from cache (None (default): load all rows). |
None
|
start |
str
|
Start-date for filtering posts. String format must be valid input for |
None
|
end |
str
|
End-date for filtering posts. String format must be valid input for |
None
|
get_metrics(level=None, returntype=None, unwrapped=False, select_func=set.intersection)
Get all available metrics that are defined for the communities. The
select_func
parameter is set to set.intersection
per
default, meaning that only those metrics are returned, that exist
for all communities. Metrics can be filtered by level
and
returntype
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
None
|
||
returntype |
None
|
||
unwrapped |
'Unwrap' the returned metric functions from their |
False
|
|
select_func |
set.intersection
|
Returns:
Type | Description |
---|---|
dict of str:func metricname:metric |
get_preprocessors(level=None, returntype=None, unwrapped=False, select_func=set.intersection)
Get all available metrics that are defined for the communities. The
select_func
parameter is set to set.intersection
per
default, meaning that only those metrics are returned, that exist
for all communities. Metrics can be filtered by level
and
returntype
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
None
|
||
returntype |
None
|
||
unwrapped |
'Unwrap' the returned metric functions from their |
False
|
|
select_func |
set.intersection
|
Returns:
Type | Description |
---|---|
dict of str:func metricname:metric |
pipelines
Pipelines
_append_preprocessing_results(results)
staticmethod
Appends Series generated by preprocessing to according datalevel objects of a Community.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
results |
should have format ( [ (datalevel, Series), (datalevel, Series), ...], Community |
required |
preprocessors
posts
number_of_words(community)
Adds the number of words in each post as int
to community.posts
.
post_position_in_thread(community)
Adds each post's position in thread (as int, starting with 1) to
community.posts
.
preprocessed_text(community, n_topics=10)
This preprocessor supplies cleaned text, text statistics (using Textacy)
and sentiment statistics (TextBlob). The following columns are added to
Community.posts
:
- clean
- all_words
- words_no_stop
- n_words_no_stop
- frac_uppercase
- frac_punctuation_marks
- avg_syllables_per_word
- sentiment_polarity
- sentiment_subjectivity
- n_words
- n_chars
- n_long_words
- n_unique_words
- n_syllables
- n_syllables_per_word
- entropy
- ttr
- segmented_ttr
- hdd
- automated_index
- flesch_reading_ease
- smog_index
- coleman_liau_index
- flesch_kincaid_grade_level
- gunning_fog_index
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required |
rounded_date(community, round_dates_to='7D')
Round the post dates according to specified frequency.
If round_dates_to
is None (default), this preprocessor does nothing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
community |
required | ||
round_dates_to |
Frequency to round the initial posts' |
'7D'
|
|
<https |
//pandas.pydata.org/docs/user_guide/timeseries.html |
required |
topics
thread_text(community)
Adds column thread_text
to community.topics
. Supplies texts of all
posts in thread as tuple of strings in order of post creation date
(starting with initial post).
registries
MetricRegistry
Bases: FuncExposer
This class exposes all methods decorated with @metric as
its own methods and passes the community
parameter to them.
PreprocessorRegistry
Bases: FuncExposer
This class exposes all methods decorated with @community_preprocessor as
its own methods and passes the community
parameter to them.
ReportRegistry
Bases: FuncExposer
This class exposes all methods decorated with @report as
its own methods and passes the communities
parameter to them.
reporting
Metric
TODO: add documentation
Report
TODO: add documentation
metric(level, returntype)
A decorator for community metrics.
The parameters level
and type
determine how and using which level of
observation (topics, posts, etc.) the metrics' results are represented.
- Only methods using this decorator are available as metrics through
pici.Community.metrics
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
pici.datatypes.CommunityDataLevel
|
The metric's data level Determines to which 'view' of pici.Community metric's results are appended to. |
required |
returntype |
pici.datatypes.MetricReturnType
|
Data type of metric's return value. |
required |
Returns:
Type | Description |
---|---|
Returns either plain metric value, or determined value(s) appended to community data.
Type determined by |
preprocessor(level)
A decorator for preprocessors.
report(func)
TODO: add documentation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
required |
visualizations
plot_lorenz_curves(pici)
Plots the %posts vs %contributors Lorenz curves for all communities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pici |
required |