Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you extend the documentation to include an example of collecting data from a table, please? #30

Open
1 task done
sxwebster opened this issue Jan 14, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@sxwebster
Copy link

Have you searched if there an existing feature request for this?

  • I have searched the existing requests

Feature description

Great work on scrapling!

I would find it particularly useful if you could give an example in the readme of how to scrape from a table, particularly if that table doesn't necessarily had an ID or class.

@sxwebster sxwebster added the enhancement New feature or request label Jan 14, 2025
@D4Vinci
Copy link
Owner

D4Vinci commented Jan 14, 2025

Hi, mate thanks for this!
I have been working on creating a documentation website that will be more detailed and have more examples.
But in your case, there are a lot of things to do. For example, if it's a traditional HTML table, then you can use the table tr td tags to get what you want.
Otherwise, another easy way is to extract a row by text and then use the find_similar() method to get the rest of the rows if the contact doesn't change.

Let me know if that helps or you can show me your exact example and I will tell you what to do with it.

@sxwebster
Copy link
Author

sxwebster commented Jan 15, 2025

Hey mate, thanks for the quick reply! You've actually caught me posting up a comment as I'm about to head off travelling for a week and a bit - so my replies might be a little slow, sorry!

The following is an example of what I'm doing using a combination of scrapling and bs4. The code is in a state of laziness since I'm still just stuffing around in a jupyter notebook before tidy up, so please accept my apologies for that.

I've also broken up the page url in the first couple of lines just so its a little less findable in search.

The output is essentially the table that's on the above page to a pandas dataframe, noting that I've also had to account for colspans.

from scrapling import Fetcher
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import datetime, timedelta
from zoneinfo import ZoneInfo

page = Fetcher().get('http://www.b    om.g    ov.au/cgi-bin/wra    p_fwo.pl?IDQ    60005.html', stealthy_headers=True, follow_redirects=True)

#print the type of the page
print(type(page))

#convert page to string
page = str(page)


def parse_time_day_to_datetime(value, today_aest=None):
    """
    Given a string like "11.49pm Tue", return a datetime in AEST for the *current week*.
    Search backward up to 6 days if the day-of-week doesn't match "today".
    """

    if not value or pd.isnull(value):
        return pd.NaT  # or None

    # If "today_aest" wasn't provided, use now in AEST:
    if today_aest is None:
        tz = ZoneInfo("Australia/Brisbane")
        today_aest = datetime.now(tz=tz)

    # 1) Extract the time portion and the day name
    # Typical format: "HH.MM(am|pm) DayName"
    # e.g. "11.49pm Tue", "6.29am Wed"
    match = re.match(r"^(\d{1,2})\.(\d{1,2})(am|pm)\s+(\w+)$", value.strip(), re.IGNORECASE)
    if not match:
        # If it doesn't match our expected pattern, return NaT or None
        return pd.NaT

    hour_str, minute_str, ampm, day_str = match.groups()
    hour = int(hour_str)
    minute = int(minute_str)
    ampm = ampm.lower()  # 'am' or 'pm'

    # Convert to 24-hour format
    if ampm == 'pm' and hour < 12:
        hour += 12
    elif ampm == 'am' and hour == 12:
        hour = 0

    # 2) Figure out which calendar date in the last 7 days has the correct day name
    # Let's define a day-of-week name map, matching strftime("%a") output: Mon, Tue, Wed, Thu, Fri, Sat, Sun

    tz = ZoneInfo("Australia/Brisbane")
    # We'll consider 'today' as the day we run the script, in AEST.
    # Then go from 0 to 6 days ago to find a date whose .strftime("%a") = day_str
    for offset in range(7):
        candidate = today_aest - timedelta(days=offset)
        if candidate.strftime("%a") == day_str.title():  # e.g. "Wed" or "Tue"
            # Found the matching day
            # Now replace hour/minute to form the final reading datetime
            reading_dt = datetime(
                candidate.year,
                candidate.month,
                candidate.day,
                hour,
                minute,
                tzinfo=tz
            )
            return reading_dt

    # If we didn't find any match (unlikely if data is only 1-5 days old), fallback
    return pd.NaT

#convert page to string
page_str = str(page)

page_soup = BeautifulSoup(page_str, "html.parser")

# Find the table (assuming there's only one table in the HTML)
table = page_soup.find("table")

all_rows = []  # will hold lists of cell values

for tr in table.find_all("tr"):
    # Gather all cells (td/th) in this row
    cells = tr.find_all(["td", "th"])
    
    # Skip row if any cell has a colspan
    # (You could also check for rowspan if needed)
    skip_row = any(cell.has_attr("colspan") for cell in cells)
    if skip_row:
        continue

    # Extract the text from each cell in the row
    row_data = [cell.get_text(strip=True) for cell in cells]
    
    # You mention the table has 7 columns, so only keep rows that have 7 cells
    if len(row_data) == 7:
        all_rows.append(row_data)

    # -------------------------------------------------------------------------
    # 1) Look for the link that contains the word "plot" (case-insensitive)
    # -------------------------------------------------------------------------
    plot_link_tag = tr.find("a", string=lambda text: text and "plot" in text.lower())
    
    # If found, parse the link's href to extract your IDQ number and second number
    if plot_link_tag and plot_link_tag.has_attr("href"):
        href = plot_link_tag["href"]  # e.g. http://www.bom.gov.au/fwo/IDQ65388/IDQ65388.540612.plt.shtml
        
        # Get the filename after the last slash, e.g. "IDQ65388.540612.plt.shtml"
        filename = href.split("/")[-1]
        
        # Use a regular expression to capture the part before the first '.' and the next part
        # Pattern: ^(.*?)\.(.*?)\.plt\.shtml$
        match = re.match(r'^(.*?)\.(.*?)\.plt\.shtml$', filename)
        if match:
            idq_number = match.group(1)     # e.g. "IDQ65388"
            second_number = match.group(2)  # e.g. "540612"
        else:
            idq_number = None
            second_number = None
    else:
        # If there is no link or no href attribute
        idq_number = None
        second_number = None
    
    # -------------------------------------------------------------------------
    # 2) Append these two new values into row_data
    # -------------------------------------------------------------------------
    row_data.append(idq_number)
    row_data.append(second_number)


# Now `all_rows` has 7 + 2 = 9 columns total:
#   original 7 columns + IDQ_number + second_number

# If the first row in `all_rows` is your header, then:
header = all_rows[0]  # e.g. the original 7 column headers (if that's truly a header row)

# Amend column titles to fill in missing values
header[7] = "IDQ_Number"
header[8] = "Station_ID"

# Assuming the first row is the header
df = pd.DataFrame(all_rows[1:], columns=all_rows[0])
# df = pd.DataFrame(all_rows[1:], columns=header)

# remove any instances of "^" from the height column
df['Height'] = df['Height'].str.replace('^', '')

# convert height column to float
df['Height'] = df['Height'].astype(float)

df["ReadingDateTime"] = df["Time/Day"].apply(parse_time_day_to_datetime)
df["ReadingDateTime"] = pd.to_datetime(df["ReadingDateTime"])

#Drop df['Recent Data'] column in place
df.drop(columns=['Recent Data'], inplace=True)

#For each column title, replace any spaces with _
df.columns = df.columns.str.replace(' ', '_')

# For Time/Day column replace the / in the title with an _
df.columns = df.columns.str.replace('/', '_')

print(df)

#export to csv
df.to_csv('flood.csv', index=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants