Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
cceb686
Corrected namespace for Intangible, hyperlinked a definition.
danbri Oct 5, 2018
2742ebf
Merge pull request #9 from google/dev
pradh Oct 10, 2018
a23917a
Merge pull request #11 from google/dev
pradh Oct 11, 2018
520928a
Merge pull request #12 from google/dev
pradh Oct 11, 2018
130d63f
Merge pull request #13 from google/dev
pradh Oct 15, 2018
2b25ff4
Merge pull request #14 from google/dev
shifucun Oct 16, 2018
95e3d62
Merge pull request #3 from google/master
vickitardif Oct 17, 2018
908de10
Added schema.datacommons.org/County.
Oct 17, 2018
eba5a9e
Merge pull request #15 from vholland/master
vickitardif Oct 17, 2018
23ea582
Added documentation for dcid and provenance.
vickitardif Oct 18, 2018
aeac3fe
Merge pull request #16 from vholland/master
shifucun Oct 18, 2018
084d95d
Added documentation for area, timezone, freebaseId, geonamesId, and w…
vickitardif Oct 23, 2018
8bdf77d
Fixed name of geonamesId.
vickitardif Oct 23, 2018
46518c1
Merge pull request #4 from google/master
vickitardif Oct 23, 2018
597c2f1
Merge pull request #18 from vholland/master
danbri Oct 24, 2018
287ac7e
Merge branch 'master' of https://github.com/google/datacommons
danbri Oct 24, 2018
3ecf3ab
Removed 'current' from the definition of timezone.
vickitardif Oct 25, 2018
f86e864
Merge pull request #20 from google/dev
shifucun Nov 1, 2018
650c362
Merge pull request #21 from vholland/master
shifucun Nov 1, 2018
0733216
Fixed the empty column bug (#24)
antaresc Nov 7, 2018
69eb578
Add GNIS property.
panesargoog Nov 14, 2018
2c2fd71
Add fipsId property.
panesargoog Nov 14, 2018
ef6a13d
Use updated caching api
Nov 14, 2018
aebb47e
Merge pull request #26 from shifucun/dev
shifucun Nov 15, 2018
c9edffb
Allow space in column name
Nov 27, 2018
82e91d2
Remove print
Nov 27, 2018
c02405d
Merge pull request #29 from shifucun/col_name
panesargoog Nov 27, 2018
8b33719
Merge branch 'master' of https://github.com/google/datacommons
danbri Dec 13, 2018
eea6820
Fixed <br> to <br/> for xmllint (our rdf parsing is xhtml-based).
danbri Dec 13, 2018
5da23b1
Preparation for use with sdoapp
Dataliberate Jan 3, 2019
311cdf8
tweaks
Dataliberate Jan 3, 2019
b1819c9
Test of changes visibility
RichardWallis Jan 3, 2019
fa8f819
Test changes
RichardWallis Jan 3, 2019
87ac31e
Added config file to repo
Dataliberate Jan 3, 2019
3f7c9cd
Merge branch 'appupdate' of https://github.com/RichardWallis/datacomm…
Dataliberate Jan 3, 2019
63f5e2c
Mods to roduce correct (vocabUri based) RDFa
Dataliberate Jan 18, 2019
236b1d6
Merge branch 'appupdate' of https://github.com/RichardWallis/datacomm…
Dataliberate Jan 18, 2019
866667b
Update datacommons.py
Spaceenter Jan 22, 2019
d422fc4
Merge pull request #31 from Spaceenter/patch-1
Spaceenter Jan 22, 2019
f9c2dfd
Add missing "()" to a query in get_places_in().
Spaceenter Jan 28, 2019
6adeac3
Merge pull request #32 from google/Spaceenter-patch-1
Spaceenter Jan 28, 2019
502b350
Update datacommons client to be compatible with new schema
Jan 29, 2019
75b1015
Merge pull request #33 from shifucun/new_api
shifucun Jan 29, 2019
750401e
Added ability to specify childhoodLocation (#34)
antaresc Jan 30, 2019
d4520a2
Add CensusTract to get_places_in API
Feb 6, 2019
17c7053
Handle error case
Feb 6, 2019
ee459c5
Add comment
Feb 6, 2019
fb315a8
Merge pull request #35 from shifucun/new_api
shifucun Feb 6, 2019
af0903a
added empty examples file for consistancy
Dataliberate Feb 8, 2019
5909632
Adjust config for release
Dataliberate Feb 8, 2019
f1364b0
Added temporary test config
Dataliberate Feb 8, 2019
8708038
Add deploymeny yaml files
Dataliberate Feb 11, 2019
0c6da52
Added draft deployment instructions file
Dataliberate Feb 11, 2019
087e065
Change deployment file from html to md
Dataliberate Feb 11, 2019
8bad00c
Merge pull request #36 from RichardWallis/appupdate
danbri Feb 13, 2019
4a5acab
For observations, use observation_date instead of start_time/end_time
Mar 1, 2019
6bdc413
Use prod client
Mar 1, 2019
7dc9b76
Merge pull request #37 from shifucun/new_api
shifucun Mar 1, 2019
558c720
Use orient split in read and save dataframe so index is not saved in …
Mar 1, 2019
c93104a
Merge pull request #38 from shifucun/new_api
shifucun Mar 1, 2019
1ea55d9
Add measurementMethod in get_observation
Mar 2, 2019
0e62bb8
Merge pull request #39 from shifucun/new_api
shifucun Mar 2, 2019
4c9f59a
Update client library to fit for new mixer string_value
Apr 16, 2019
5b4ef99
Merge pull request #43 from shifucun/new_api
shifucun Apr 16, 2019
4371901
Fix header prefix and date format
Apr 16, 2019
a3f01b4
Merge pull request #44 from shifucun/new_api
shifucun Apr 16, 2019
618d0e4
docstring typos and consistency
tjann Apr 30, 2019
6e4808f
Merge pull request #45 from tjann/patch-1
tjann May 1, 2019
c033482
Reimplemented base API
antaresc May 3, 2019
b806269
Implemented places extension
antaresc May 3, 2019
ff80926
Fixed bug
antaresc May 3, 2019
1432372
Added bio stub
antaresc May 3, 2019
3156473
Finished re-implementing bio extension... n o w t o t e s t.
antaresc May 4, 2019
806a927
Added examples and fixed DCFrame bugs
antaresc May 6, 2019
d0e4233
added populations stub for pop extension
tjann May 6, 2019
c3065d8
implemented get_pop and get_obs in populations API n o w t o t e s t
tjann May 7, 2019
4b49b3b
Merge branch 'feature/api-version-2' of github.com:ACscooter/datacomm…
tjann May 7, 2019
6b2e200
Fixed dangling line in places
antaresc May 7, 2019
6f24e91
BioExtension demo works
antaresc May 7, 2019
45bf9c0
Some tweaks
antaresc May 7, 2019
b388983
Fixed header comment
antaresc May 7, 2019
de19a17
missing comma
tjann May 13, 2019
968c04c
fixed typo self._col_type -> self._col_types
tjann May 13, 2019
e624df1
seed and new col val already have ? append to beg
tjann May 13, 2019
0f48885
fixing populations library and updating infra
tjann May 13, 2019
1273e93
Merge branch 'feature/api-version-2' of github.com:ACscooter/datacomm…
tjann May 13, 2019
d29d07f
similar fix for get_obs for extra ? for colvar
tjann May 13, 2019
cc8b785
get useful prints in test/examples
tjann May 13, 2019
9fce7c5
Merge branch 'feature/api-version-2' of github.com:ACscooter/datacomm…
tjann May 13, 2019
f814b5d
fat fingers on copy and paste
tjann May 13, 2019
698a1b7
places.py self._col_type -> self._col_types
tjann May 13, 2019
61bc795
Implemented draft of weather API extension
antaresc May 29, 2019
232c400
Added weather example
antaresc May 29, 2019
3d0e2b0
Weather API works
antaresc Jun 5, 2019
42b8ff1
Added bio mixer specs and tweaked bio API
antaresc Jun 8, 2019
a7191f8
Fixed places bug
antaresc Jun 13, 2019
8a9f1d0
Merged dev branch
antaresc Jul 8, 2019
6026c75
Added data cleaning helpers
antaresc Jul 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 3 additions & 32 deletions datacommons/bio.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ def get_experiments(self, new_col_name, **kwargs):
# Specify select and process functions to filter for biosample class and
# terms. This enforces the paired-ness of term and class
select = select_biosample_summary('?bioClass', '?bioTerm', classes, terms)
process = delete_column('?bioClass', '?bioTerm')
process = utils.delete_column('?bioClass', '?bioTerm')
if 'lab_name' in kwargs:
lab_names = ['"{}"'.format(name) for name in kwargs['lab_name']]
query.add_constraint(new_col_var, 'lab', '?labNode')
Expand Down Expand Up @@ -291,9 +291,9 @@ def get_bed_lines(self, seed_col_name, prop_info=DEFAULT_BEDLINE_PROPS, **kwargs
# If filters were specified, compose the filters and add a post processor if
# necessary.
if select_funcs:
select = compose_select(*select_funcs)
select = utils.compose_select(*select_funcs)
if drop_cols:
process = delete_column(*drop_cols)
process = utils.delete_column(*drop_cols)

# Perform the query and merge
new_frame = DCFrame(datalog_query=query,
Expand Down Expand Up @@ -366,32 +366,3 @@ def select(row):
return True
return False
return select

def compose_select(*select_funcs):
""" Returns a filter function composed of the given selectors.

Args:
select_funcs: Functions to compose.

Returns:
A filter function which returns True iff all select_funcs return True.
"""
def select(row):
return all(select_func(row) for select_func in select_funcs)
return select

def delete_column(*cols):
""" Returns a function that deletes the given column from a frame.

Args:
cols: Columns to delete from the data frame.

Returns:
A function that deletes columns in the given Pandas DataFrame.
"""
def process(pd_frame):
for col in cols:
if col in pd_frame:
pd_frame = pd_frame.drop(col, axis=1)
return pd_frame
return process
30 changes: 17 additions & 13 deletions datacommons/datacommons.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ def query(self, datalog_query, rows=100):
RuntimeError: some problem with executing query (hint in the string)
"""
assert self._inited, 'Initialization was unsuccessful, cannot execute Query'

# Append the options
options = {}
if self._db_path:
Expand All @@ -123,13 +123,13 @@ def query(self, datalog_query, rows=100):

# Send the query to the DataCommons query service
try:
response = self._service.query_table(body={
'query': datalog_query,
'options': options
}).execute()
response = self._service.query_table(body={
'query': datalog_query,
'options': options
}).execute()
except Exception as e:
msg = 'Failed to execute query:\n Query: {}\n Error: {}'.format(datalog_query, e)
raise RuntimeError(msg)
msg = 'Failed to execute query:\n Query: {}\n Error: {}'.format(datalog_query, e)
raise RuntimeError(msg)

# Format and return the result as a DCFrame
header = response.get('header', [])
Expand Down Expand Up @@ -307,17 +307,21 @@ def types(self):
"""
return self._col_types

def pandas(self, col_names=None):
def pandas(self, col_names=None, ignore_populations=False):
""" Returns a copy of the data in this view as a Pandas DataFrame.

Args:
col_names: An optional list specifying which columns to extract.
ignore_populations: Ignores all columns that have type
StatisticalPopulation. col_names takes precedence over this argument

Returns: A deep copy of the underlying Pandas DataFrame.
"""
if col_names:
return self._dataframe[col_names].copy()
return self._dataframe.copy()
if not col_names:
col_names = list(self._dataframe)
if ignore_populations:
col_names = list(filter(lambda name: self._col_types[name] != 'StatisticalPopulation', col_names))
return self._dataframe[col_names].copy()

def csv(self, col_names=None):
""" Returns the data in this view as a CSV string.
Expand All @@ -329,7 +333,7 @@ def csv(self, col_names=None):
The DataFrame exported as a CSV string.
"""
if col_names:
return self._dataframe[col_names].to_csv(index=False)
return self._dataframe[col_names].to_csv(index=False)
return self._dataframe.to_csv(index=False)

def tsv(self, col_names=None):
Expand All @@ -342,7 +346,7 @@ def tsv(self, col_names=None):
The DataFrame exported as a TSV string.
"""
if col_names:
return self._dataframe[col_names].to_csv(index=False, sep='\t')
return self._dataframe[col_names].to_csv(index=False, sep='\t')
return self._dataframe.to_csv(index=False, sep='\t')

def rename(self, labels):
Expand Down
61 changes: 14 additions & 47 deletions datacommons/examples/analysis_populations.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,91 +41,58 @@ def print_pandas(example_num, df):
print('\n')

def main():

frame_1 = datacommons.DCFrame() # establish generic df
frame_1 = PopulationsExtension(frame_1) # add population features to df

# Start by initializing a column of three US states: California, Kentucky, and
# Maryland.
frame_1.add_column('state_dcid', 'State', ['geoId/06', 'geoId/21', 'geoId/24'])
print(frame_1.pandas())
print_pandas(1, frame_1.pandas())

# Name is an outgoing property of the State. We can call expand to populate a
# column 'state_name' with names of states corresponding to dcids in the
# 'state_dcid' column.
frame_1.expand('name', 'state_dcid', 'state_name')
print(frame_1.pandas())

# Get populations for state
frame_1.get_populations(
seed_col_name='state_dcid',
new_col_name='state_population',
population_type='Person',
rows=100)
print(frame_1.pandas())
print_pandas(2, frame_1.pandas())

frame_1.get_populations(
seed_col_name='state_dcid',
new_col_name='state_18_24_years_population',
new_col_name='state_male_population',
population_type='Person',
rows=100,
age='USC/18To24Years')
print(frame_1.pandas())
gender='Male')
print_pandas(3, frame_1.pandas())

frame_1.get_populations(
seed_col_name='state_dcid',
new_col_name='state_male_population',
new_col_name='state_female_population',
population_type='Person',
rows=100,
gender='Male')
print(frame_1.pandas())
gender='Female')
print_pandas(3, frame_1.pandas())

# Get observations on state populations
frame_1.get_observations(
seed_col_name='state_population',
new_col_name='state_person_2016_count',
observation_date='2016',
measured_property='count')
print_pandas(4, frame_1.pandas())

# Add 3 counties contained in each state
frame_1.expand(
'containedInPlace',
'state_dcid',
'county_dcid',
new_col_type='County',
outgoing=False,
rows=3)
print(frame_1.pandas())

# Get populations for counties
frame_1.get_populations(
seed_col_name='county_dcid',
new_col_name='county_population',
population_type='Person',
rows=100)
print(frame_1.pandas())
# To ignore the population columns...
print_pandas(5, frame_1.pandas(ignore_populations=True))

frame_1.get_populations(
seed_col_name='county_dcid',
new_col_name='county_18_24_years_population',
population_type='Person',
rows=100,
age='USC/18To24Years')
print(frame_1.pandas())
# Print the max population count
print('Max population count...')
print(frame_1.pandas()['state_person_2016_count'].max())

frame_1.get_populations(
seed_col_name='county_dcid',
new_col_name='county_male_population',
population_type='Person',
rows=100,
gender='Male')
print(frame_1.pandas())

# Get observations on county populations
frame_1.get_observations(
seed_col_name='county_population',
new_col_name='county_person_2016_count',
observation_date='2016',
measured_property='count')
if __name__ == '__main__':
main()
21 changes: 15 additions & 6 deletions datacommons/populations.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ def get_observations(self,
observation_date,
measured_property,
stats_type=None,
clean_data=True,
rows=100):
"""Create a new column with values for an observation of the given property.
The current pandas dataframe should include a column containing population
Expand All @@ -122,6 +123,7 @@ def get_observations(self,
observations_date: The date of the observation (in 'YYY-mm-dd' form).
measured_property: observation measured property.
stats_type: Statistical type like "Median"
clean_data: A flag to convert to numerical types and filter out any NaNs.
rows: The maximum number of rows returned by the query results.

Raises:
Expand Down Expand Up @@ -169,14 +171,21 @@ def get_observations(self,
query.add_constraint('?o', 'observationDate', '\"{}\"'.format(observation_date))
query.add_constraint('?o', 'measuredProperty', measured_property)
query.add_constraint('?o', '{}Value'.format(stats_type), new_col_var)
measurementMethod = None
measurement_method = None
if measured_property == 'prevalence':
measurementMethod = 'CDC_CrudePrevalence'
measurement_method = 'CDC_CrudePrevalence'
elif measured_property == 'unemploymentRate':
measurementMethod = 'BLSSeasonallyUnadjusted'
if measurementMethod:
query.add_constraint('?o', 'measurementMethod', measurementMethod)
measurement_method = 'BLSSeasonallyUnadjusted'
if measurement_method:
query.add_constraint('?o', 'measurementMethod', measurement_method)

# Check if data should be cleaned
clean_func = None
if clean_data:
type_func = utils.convert_type(new_col_var, 'float')
nan_func = utils.drop_nan(new_col_var)
clean_func = utils.compose_process(type_func, nan_func)

# Perform the query and merge the results
new_frame = DCFrame(datalog_query=query, labels=labels, type_hint=type_hint, rows=rows)
new_frame = DCFrame(datalog_query=query, labels=labels, process=clean_func, type_hint=type_hint, rows=rows)
self.merge(new_frame)
79 changes: 79 additions & 0 deletions datacommons/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@

from collections import OrderedDict

import pandas as pd


class MeasuredValue:
""" An enumeration of valid measured values in the DataCommons graph.
Expand Down Expand Up @@ -89,3 +91,80 @@ def add_constraint(self, sub, pred, obj):
if pred not in self._constraints[sub]:
self._constraints[sub][pred] = []
self._constraints[sub][pred].append(obj)


# ------------------------ SELECT AND PROCESS HELPERS -------------------------


def convert_type(col_names, dtype):
""" Converts values in a given column to the given type.

Args:
col_names: The column or columns to convert
dtype: Data type or a dictionary from column name to data type.

Returns: A process function that converts the column to a given type.
"""
if isinstance(col_names, str):
col_names = [col_names]
def process(pd_frame):
for name in col_names:
pd_frame[name] = pd.to_numeric(pd_frame[name])
return pd_frame
return process

def drop_nan(col_names):
""" Drops rows containing NAN as a value in columns in col_names.

Args:
col_names: single column name or a list of column names.
"""
if isinstance(col_names, str):
col_names = [col_names]
def process(pd_frame):
return pd_frame.dropna(subset=col_names)
return process

def delete_column(*cols):
""" Returns a function that deletes the given column from a frame.

Args:
cols: Columns to delete from the data frame.

Returns:
A function that deletes columns in the given Pandas DataFrame.
"""
def process(pd_frame):
for col in cols:
if col in pd_frame:
pd_frame = pd_frame.drop(col, axis=1)
return pd_frame
return process

def compose_select(*select_funcs):
""" Returns a filter function composed of the given selectors.

Args:
select_funcs: Functions to compose.

Returns:
A filter function which returns True iff all select_funcs return True.
"""
def select(row):
return all(select_func(row) for select_func in select_funcs)
return select

def compose_process(*process_funcs):
""" Returns a process function composed of the given functions.

Args:
process_funcs: Functions to compose.

Returns:
A process function which performs each function in the order given.
"""
def process(pd_frame):
for process_func in process_funcs:
pd_frame = process_func(pd_frame)
return pd_frame
return process